►
Description
Discussing a way forward in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16115
A
So
today
is
august
18th
and
we
are
discussing
what
a
necessary
staple
counterpart
for
the
kitteny
team
is
and
what
kind
of
projects
we
can
work
on
together,
both
from
the
getaway
team
and
sorry.
A
I
want
to
drive
some
context
around
this.
I
guess
in
the
past,
scalability
team
has
been
the
de
facto
as
igor
saw,
that
the
de
facto
saber
counterpart
for
get
to
d1.
They
needed
to
do
some
configuration
changes
when
they
weren't
sure
about
or
when
they
needed
to
roll
out
something
in
production.
A
Scalability
team
has
been
happy
out
there,
which
ended
up
resulting
into
the
reliability
team
having
little
knowledge
on
italy,
and
it
was
mostly
yeah
like
we
don't
have
any
in-house
knowledge
on
the
reliability
team
about
italy,
and
sometimes
we
are
on
call
for
that
service
or
sometimes
we
need
to
roll
out
a
change
for
that
team,
because
the
scalability
team
is
too
busy
or
something,
and
we
don't
really
have
any
go-to
person
for
that
and,
like
the
managers
always
end
up
like
rolling
the
dice
on
that.
A
So
it's
right
in
the
name
of
the
company,
so
I
I
think
that's
what
is
one
of
the
big
drivers
for
the
stable
counterpart
we've
had
in
the
past
table
counterparts
for
the
runner
team
and
igor
was
one
of
those
last
year
and
we
also
had
a
stable
counterpart
for
the
fulfillment
team
for
the
customer.migration.
A
A
Any
questions
so
far
on
that
awesome.
So
I
wanted
to
set
up
this
meeting
because
we
had
some
async
discussions
on
the
issue
about
some
proposals
on
what
we
can
work
on,
and
things
like
that.
But
before
going
through
that,
I
wanted
to
ask
like
are:
are
we
creating
a
solution
for
a
problem
that
does
not
exist
like
do
do?
Do
we
actually
need
a
stable
counterpart
in
the
sense
like?
A
Does
the
utility
team
feel
like
there's
problems,
that
the
infrastructure
team
is
not
addressing
and
or
has
little
bandwidth
to
address
kind
of
thing,
and
maybe
andres
you're?
The
expert
here
on
this
part.
B
I
think
the
main
issue
we
could
solve
that
is
sort
of
a
a
lingering
time
bomb
is
the
lack
of
expressive,
specific
expertise
in
italy
on
the
infrastructure
team,
and
this
affects
incident
response
in
particular,
but
over
time.
I
think
it
also
affects
what
infrastructure
design
changes
you
make.
That
might
turn
out
not
to
be
a
good
fit
for
italy
simply
because
we
couldn't
articulate
it
well,
and
you
didn't
have
the
the
necessary
insight
to
to
take
it
into
account.
C
B
A
Iteration
of
having
stable
counterparts-
we've
tried
this
before
in
the
past,
but
with
a
lot
less
people
in
the
team
and
a
lot
less
focus
time
as
well,
because
we
only
recently
had
like
30
people
on
the
reliability
team,
and
now
we
have
kind
of
the
bandwidth
to
do
this.
So
we
don't
have
that
scope
at
the
moment,
and
I
think
this
is
one
of
the
first
iterations
to
do
that
so
yeah
agreed
there
excellent.
A
B
This,
like,
yes,
we
have
operational
things
like
production,
access
and,
and
things
like
that,
but
I
see
more
even
more
value
in
design
work
in
in
in
co-designing
things,
especially
like
what
we're
doing
now
changing
the
whole
way.
The
application
works.
Yeah.
B
At
which
point,
whoever
has
expertise
has
a
better
chance
of
jumping
into
this,
and
so,
if
you
already
had
built
that-
and
some
of
you
would
some
of
you
have
it
like,
we
would
get
more
value
out
of
this.
That's
yes,.
A
Agreed
yeah
and
we
can
also
have
like
a
voice
in
the
infrastructure
about
the
design
because,
like
I
saw
a
comment
on
some
prefect
merchant
was
one
day
and
I
was
between
matt
and
jakob
and,
like
matt,
provided
a
point
of
view
from
an
infrastructure
department
and
like
jakob
said:
oh,
that
makes
complete
sense,
but
I
didn't
have
visibility
around
that
and,
like
I
think
having
that
is
actually
what
can
solve
a
lot
of
problems
there.
So
I
agree
with
that.
A
So
you
mentioned
the
re-architecture
of
prefect
or
using
rough
protocol
not
to
use
postgres.
For
example.
What's
the
progress
there
like?
Where
are
we
at
at
that
stage?
Is
it
still
being
in
an
rft
process
like
how
is
that.
B
We
had
we
had
a
long
internal
discussion
about
the
details
of
it
and
now
that
now
that
it
seems
feasible,
we
don't
think
there
will
be
major
blockers.
We
also
found
that
it
solves
actually
a
lot
of
problems
that
we
have
now.
Some
is
writing
up.
What
was
a
fairly
fragmented
discussion
on
an
issue.
Is
writing
a
design
doc
so
that
there
could
be
a
second
round
of
review
without
having
to
you
know,
piece
together.
The
information
yourself
got.
A
B
And
this
is
our
q3
and
from
then
on,
I'm
hoping
that
this
will
be
blessed
by
everybody
who
needs
to
bless
it,
and
then
we
can
go
and
implement.
The
pla
is
including
the.
How
do
we
iterate
towards
results
and
how
do
we?
How
do
we
actually
get
there,
because
I
guess
that's
also
at
as
hard
as
the?
How
will
it
work.
A
Yeah
and
like
does
the
rfc
will
have,
will
the
rfc
have
like
a
rollout
proposal
as
well
like
how
we
will
deprecate
the
postgres
sql
database,
or
things
like
that,
or
is
that
still
out
of
scope.
B
I
I
think
we
need
both
yeah
and
we
we
have
two
okrs
one
is
the
writer
design
doc.
The
other
is
write,
a
project
got
it
and
either
like.
Yes,
we
will
need
eventually
a
detailed
rollout
plan.
B
The
current
hope
is
that
it
will
mostly
be
by
osmosis.
So
as
we
upgrade
software,
the
plan
is
to
start
using
raft
like
keep
postgres
keep
everything
running
as
it
is
just
start
using
graft.
In
the
background
to
also
replicate
information,
and
once
that
works,
then
we
figure
out
how
to
switch
over
and
then
remove
all
the
postgres
and
that
this
would
efficient
effectively
take
all
the
single
nodes
into
a
single
node
cluster
situation,
which
aligns
very
well
with
our
goal
of
that
there
shouldn't
be
cluster
and
node
cluster.
A
Okay,
okay,
yeah.
And
how
can
the
sorry.
C
Hey
sorry
still,
oh
god,
I
wanted
to
go
back
to
the
previous
point
quickly
and
just
on
a
on
a
higher
level.
Also,
maybe
give
a
little
bit
of
input
from
kind
of
the
scalability
side.
So
I
think
the
collaboration
has
been
working
pretty
well
between
guitar
and
scalability.
I'm
I'm
pretty
happy
with
that.
C
Where
I
see
the
bigger
gap,
though,
is
when
it
comes
to
scope
and
scalability
is
really
focused
on
well
scalability
and
performance
right,
so
we've
been
heavily
engaged
in
optimizing
like
we
designed
some
some
caching
related
logic.
We've
been
putting
into
place
profiling
tooling.
C
So
it's
very
much
within
that
realm,
but
that
leaves
a
lot
of
stuff
kind
of
out
of
scope
and
where
I
see
the
biggest
gaps
are
operational
tooling,
and
it's
already
hinted
at
a
bit
later
in
the
agenda.
We
can.
We
don't
have
to
go
into
the
details
right
this
second,
but
how
do
we
manage
our
italy
fleet?
How?
How
do
we
provision
and
rebalance
the
the
shards?
C
Those
are
questions
that
aren't
really
addressed
by
the
scalability
team
and
it's
kind
of
outs
outside
of
the
scope
of
the
the
mission
of
the
team.
So
that's
where
I
see
a
potential
opportunity
for.
C
Yeah
I
mean,
I
think
the
scalability
managers
are
a
little
more
protective
of
the
scope,
so
we're
kind
of
punting
more
of
that
responsibility
over
to
reliability,
which
kind
of
pushes
just
pushes
the
problem
elsewhere,
but
I
think
you
know
we
we
need
to
address
it
somehow.
A
Yeah,
I
think
that
makes
sense-
and
I
agree-
that's
so
so
like
jumping
on
a
bit
on
the
proposals
for
projects
that
the
sr
team
can
work.
So,
for
example,
the
c
group
stuff
that's
from
what
you
just
described,
eco,
that
really
falls
under
the
scalability
team
rather
than
because
that
is
more
cpu
constraint.
Like
constraining,
cp,
cpu
usage.
Things
like
that
feel.
C
A
B
C
A
Awesome:
okay:
okay:
let's
look
at
the
time!
Okay,
15
minutes
left,
so
I
was
gonna
jump
on
to
like
how
can
the
sre
team
become
more
involved,
like
the
reliability
team,
specifically
get
more
involved
in
in
the
design
dock
for
for
bluster?
Is
it
a
matter
of
just
waiting
for
sami
to
write
down
everything
and
then
we'll
just
review
that
merge
request
or
is
there
anything
else
that
we
can
do
there.
B
B
A
Okay
and
like
it
would
be
completely
okay
for
us
to
start
looking
at
it
and
poking
at
it,
and
leave
some
feedback.
If
you
find.
B
A
A
Okay,
so
looking
at
the
current
projects
that
the
sra
team
can
work
on
like
for
this
quarter
and
the
next
one
coming,
so
it's
pretty
clear
that
support
for
getting
cluster
our
architecture
direction
as
number
one
priority
since
that's
gitalis
okr,
so
you
can
kind
of
tie
those
two
together,
there's
also
the
like
the
c
groups,
rollout,
that's
ongoing
and
I'm
curious.
If
like
should
the
reliability
team
pick
it
up.
A
I
know
rachel
raised
some
concerns
around
that
because
of
ramping
up
time
and
like
context,
switching
and
things
like
that,
I
I'm
curious
to
hear
andrew's
opinion
on
this.
Like
is
there
a
lot
of
work
still
left?
Is
there
like
a
big
amount
of
context
together
around
it?
Or
can
we
like
kind
of
pick
it
up
in
two
weeks
and
like
continue
the
work
with
the
guitar
team
to
get
that
ready.
C
B
Going
in
waves,
sort
of
I
need
to
talk
to
john
who's
been
doing
the
actual.
A
A
C
C
B
Improve
on
it,
that
would
be
awesome.
Awesome.
A
Okay,
so
let's
write
down
some
action
items
in
the
agenda,
so
the
first
section
item
and
like
this
is
for
me
just
thinking
to
write
down
the
okrs
or
whatever
the
stable
counterpart
is
going
to
be
doing
this
quarter
and
next
for
the
reliability
team.
So.
B
Before
you
do
that,
I
have
a
meta
question
about
the
stable
counterparts.
I
want
to
make
sure
that
this
is
not
in
addition
to
all
their
regular
responsibilities,
because
then
they
still
have
the
full
scope,
and
it's
like
some
of
that
time
at
least
needs
to
be
dedicated
to
this,
so
for
them
to
be
efficient.
A
Yeah,
no
so
from
what
I
understand
from
can
both
me
and
calliope
are
gonna,
be
quantum
code,
full-time,
stable
counterparts
and
the
scientists
in
the
sense
that
any
issues
that
come
up
like,
for
example,
I
don't
know
c
groups
dot
and
the
rough
architecture.
That's
what
we're
gonna
start
working
on
next
week,
for
example,
cool.
That's
the
kind
of
commitment
that
we're
thinking
about
and
also
like.
Even
when
we
start
discussing,
like
I
don't
know,
ssh
access
for
digitality,
yeah.
B
A
B
A
Yeah,
I
I
think
we're
gonna
get
started
with
that
next
week
and
that
sounds
like.
B
A
Number
one
priority
for
disporters
to
be
getaly's
a
stable
counterpart,
and
if
that
means
like
giving
you
ssh
access
and
then
limited
or
in
a
safe
way,
and
then
that
would
be
our
p1
comp
and
helping
out
with
the
raft,
design
and
c
groups
rollout.
Since
the
c
groups
has
been
happening
in
waves,
kind
of
thing.
B
C
A
A
I've
talked
to
jarv
about
this
this
week
actually
and
he
wants
to
drive
it
and
that's
completely
fine
and
yes,
we
can
help
out
when
even
us
just
being
aware
that
costs
are
coming
and
like
I
I
I
know
you
folks
talked
about
like
volume,
management
and
like
why
don't
we
have
volume
management,
kind
of
thing?
A
And
things
like
that,
so
I
think
we're
aware
of
that.
We're
just
leaving
it
up
to
jar,
because
for
us
that
seems
more
of
a
cost
savings
measured
on
a
reliability
measure
that
makes
sense
yeah.
So
it.
A
A
That's
a
good
point.
Yes,
especially
if
we're
gonna
end
up
seems
like
it's
the
simplest
way
having
a
downtime
on
on
each
node.
So,
yes,
yeah,
that's
a
good
call
out.
We
can
keep
an
eye
on
that
as
well
and
see
if
we
can
help
out
in
any
way
or
at
least
be
aware
that
things
are
happening
there.
B
We
have
like,
I
don't
know
how
much
you
know
about
italy,
the
internals
and
how
this
works
or
how
much
you
wanna
know
what
it
would
you
wanna
learn
like.
Would
you
to
watch
our
new
new
team
member
videos
or
would.
A
I
would
like
I
would
go
as
far
as
say
if
you
can
join
if
you
have
your
weekly
meetings.
That
would
be
yes,
perfect,
because
even
if
we
like
are
aware
of
what's
going
on
in
five
minutes,
okay,
perfect,
so
we
can
join
there.
If
you
want,
please.
B
Come
along,
I
would
be
delighted
to
have
you.
I
also
hang
out
on
italy,
lounge
yep.
I
can
add
you
to
a
bunch
of
things.
Yes,.
A
Do
that
okay
and
I'll
write
down
and
I'll
pick
with
a
proposal
of
the
things
that
we're
gonna
work
on
for
qt
q3,
which
would
be
rough
designs,
c
groups
and
potentially
ssh
access?
Unless
we
can
talk
about
this
async
unless
we
feel
like
rebalancing,
is
more
of
a
problem
than
ssh
access.
But
we
can
talk
to
that.
A
B
A
Yes,
if
someone
else
says
otherwise,
then
it's
up
it's
my
fault,
but
from
all
the
information
I
have
it's
me
and
calliope
that's
going
to
work
on
it.
So
yeah,
let's
go
ahead
with
that
and
roll
a
roll
with
it.
A
I
also
want
to
open
up
emergency
quest
today
in
the
handbook
as
well
to
make
it
official
and
make
it
like.
This
is
what
the
handbook
says
kind
of
thing,
so
yep
thank.
C
C
So
steve
when
it
comes
to
knowledge
transfer
in
particular
on
c
groups,
because
I
you
know,
I
think,
there's
there's
a
few
folks
on
the
scalability
team
that
do
have
expertise
with
this
design.
I'd
be
happy
to
help
facilitate
some
some
of
that
knowledge
transfer.
C
Maybe
if
you
could
connect
with
rachel
to
to
kind
of
see
what
the
what
the
time
allocation
there
is
and
whether
it
makes
sense
to
share
that
with
the
sre
team.
I
think
it
does
make
sense.
A
Yeah,
I
think
it
does
because
at
the
end
of
the
day,
if
we
have
to
modify
some
c
groups,
we'd
have
to
know
what
and
where
to
modify
so
yeah.
I
agree
with
that.
I
I
think
rachel
and
the
epic
to
discuss
like
an
official
handover
or
whatever
that
we
need
to
be
doing
so.
Yeah
sounds.
C
A
A
Awesome
so
thank
you,
igor
and
kaleopi
for
keeping
the
notes
up
to
date
on
the
agenda
up
to
date.
I
appreciate
that
because
I'm
terrible
at
doing
that
so
yeah,
I
think
we
have
a
clear
action
plan
for
now
just
to
continue
discussing
async,
especially
especially
on
priorities.
Is
there
anything
else,
that's
not
clear
for
anyone.