►
Description
Part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/143
A
A
B
B
A
A
C
A
D
Power
I
didn't
explain
it
just
it
was
reconciling
the
data
been
study
mo
config
in
the
pod
versus
the
database,
that
vehicle
could
fit
on
inside
the
cake.
Exporter
BN
the
biggest
gap
right
now
are
the
replicas
for
application
load
balancing.
This
is
something
that
I
don't
think
we
anticipated,
which
is
that
we
use
console,
which
does
a
DNS
lookup
to
get
the
list
of
replicas
for
application
load
balancing
and
we
don't
in.
A
D
You
don't
have
a
good
story
yet
for
how
to
do
this.
You
know
in
our
psychic
pod
we
can
run.
Maybe
we
can
run
console,
maybe
have
a
couple
options:
I
tried
to
lay
them
out
a
new
issue,
we
run
a
sidecar,
do
we
just
run
console
console
agent
and
another
pod,
and
then
you
know
make
console
available
for
sidekick
pod.
You
know,
I,
don't
know,
I
think
we
have
lots
of.
We
have
different
options
that
we
really
need
to
figure
out
and
whether
this
goes
into
the
official
chart
or
not
I
don't
know
either.
D
D
For
in
general,
because
you
know,
Patroni
is
the
one
registering
itself
as
a
console
service
that
has
all
the
replicas
in
the
list.
You
know
it's
kind
of
custom
to
get
like
calm
I
mean
other
people
could
do
this,
but
they
don't
necessarily
do
it.
We
do
ship
with
console,
so
you
know
I,
don't
know
like,
but
I
think
I
think
this
pattern
of
using
this
DNS
console
DNS
for
load
balancing
is
probably
unique
to
us,
or
very
few
people
are
doing
it.
B
A
Mean
I
am
more
inclined
to
say
we
want
to
understand
the
the
impact
of
this
one
cue
in
kubernetes,
so
my
vote
would
go
even
for
something
that
is
not
permanent,
like
we
don't
have
to
make
a
decision
immediately
and
so
five
problems
like
we
don't
like.
If
we
go
down
the
route
of
thinking
about
all
the
use
cases
that
the
users
could
have
that's
gonna
drag
us
out
in
a
completely
different
direction.
A
D
Could
be
I
mean
if
we
really
wanted
to?
We
could
also
just
go
with
that
load
balancing
for
this
queue,
because
this
queue
is
probably
just
not
making
a
whole
lot
of
database
calls
once
we
I
mean
at
least
what
I'm
discovering
during
my
load
testing
scene.
What
you're
doing
production
I
mean
we
just
we're
not
doing
exports
at
a
high
rate
at
all.
That's
very
it's
very
low.
So,
except
when
someone
abuses
us
but
I,
mean
I,
think
we'll
just
work
it
out
we're
working
out
the
issue.
A
A
A
D
D
C
D
D
You
know
what
recurrence
currently
seen
in
production
for
export.
In
general,
I
looked
over
I
just
looked
over
a
24
hour
period.
We
peak
at
10
exports
a
second
but
that's
fairly
unusual.
It's
more
typical
radius,
5
exports
every
minute,
so
it's
really
slow.
You
don't
see
a
lot
of
these.
Of
course,
when
we
do
get
abused,
it
goes
up
quite
a
bit.
I
didn't
realize
that
we
already
have
rate
limiting
in
that
for
export,
specifically
in
the
application,
and
this
is
not
the
Rack
Attack
or
anything.
D
You
know
it's
actually
just
a
rate
limit
for
I.
Think
like
one
export,
every
five
minutes
per
project
and
yeah
I
ran
into
this
during
load
testing,
because
the
first,
of
course
the
first
thing
I
want
to
try
to
do
is
just
like
you
know,
export.
You
know
like
a
hundred
times
for
one
project
and
I
didn't
work,
so
I
had
to
hack
the
code
to
make
that
work,
but
I
think
the
first
thing
we're
gonna
try
here
is
just
doing
an
export
across
ten
different
projects.
D
The
project
example
that
I'm
using
is
games
Apple
because
I
just
looked
for
a
medium-sized
project
and
this
one's
around
four
hundred
megabytes.
It's
like
I
think
a
good
example
of,
like
maybe
kind
of
on
the
larger
side.
I
think
using
like
these
very
small
projects
is
not
a
good,
a
good
reference
so
and
I,
unfortunately,
with
pre
prod,
which
is
where
I'm
doing
my
testing,
using
a
very
large
project
like
give
lab
HQ
or
the
Linux
kernel.
D
I
was
having
trouble
and
I
think
it
has
to
do
with
the
fact
that
free
projects
is
ahead
of
us.
I
was
just
taking
very
long,
so
I
think
when
we
go
to
staging
we're,
gonna
be
able
to
do
some
better
load
testing
with
very
large
projects.
Ansible
itself
takes
about
20,
20
seconds
or
so
to
to
export,
so
I
think
it's
a
good
example.
D
So
here
the
lower-right
I
have
just
watching
the
pods.
You
see
that
there's
one
export
pod
here
running
I'm
telling
the
logs
and
I'm
just
pulling
out
some
interesting
things
from
side
to
this
is
also
going
into
stock
driver.
But
it's
a
bit
nicer
to
see
it
in
the
terminal,
so
I
have
a
for
loop.
So
what
I'm
doing
here
is
I
have
ten
different
projects,
they're
all
just
imports
intangible
and
we're
just
going
to
hit
the
API
and
we're
going
to
export
all
of
them
right
after
each
other.
D
So
you
can
see
like
successful
responses
for
all
of
them.
You
can
see
that
we
have
the
first
down
here
in
the
lower
left
just
to
explain
this
is
the
queue
latency.
This
is
the
time
that
it
goes
into
Redis
to
the
time
that
sidekick
picks
it
up.
You
can
see
it's
very
fast
and
then
it
completed
in
about
15
seconds.
Now
you
see
that
the
next
one
started
now
the
queue
latency
is
a
bit
higher
because
it
was
waiting
to
the
Redis
key
for
longer.
We
look
over
here
and
we
can
see
that.
D
D
So
you
can
see
the
CPU
is
going
up
and
we'll
probably
you
know,
HP
is
probably
starting
to
think
about
scaling
and
concede
right
now,
that's
I
can
export
now
is
starting
to
bring
up
another
one.
This
is
one
thing,
I
think
we're
kind
of
aware
of
this,
but
at
the
same
time,
maybe
not
realizing
how
lucky
it
is
is
how
long
it
takes
for
these
pods
to
boot.
It's
a
minute
and
40
seconds,
which
is
terrible.
The
dependency
init
container
alone
takes
40
seconds.
D
So
if
we
just
make
that
a
really
simple
check
to
make
sure
that
the
David
a
schema
is
correct
too
then
Beth
shaves
off
40
seconds
so
now
we're
down
to
a
minute
still
seems
too
long
to
me,
but
at
least
it's
a
little
bit
better.
So
now
we
have
like
you
know
some
more
export
again.
So
now
we
just
had
two
of
them,
but
we'll
probably
it'll
probably
start
to
spin
up
a
third
one,
and
you
see
the
second
one
is
still.
D
So
if
you
notice
here
on
the
lower
left,
you
can
see
that
to
the
start,
like
nothing
is
being
done
in
parallel
right
now,
we're
just
doing
one
at
a
time,
so
these
latencies
are
creeping
up,
because
these
are
the
like
the
amount
of
times
that
you
know
the
job
is
in
the
Redis
queue.
You
know
the
latency,
of
course.
The
time
it
takes
is
the
same,
doesn't
really
change,
but
still
you
have
to
wait
a
long
time
just
for
the
job
to
be
picked
up
so.
A
D
Exactly
in
fact,
I
saw
this
when
I
tried
to
export
give
at
HQ.
You
know
from
github
I
I
tried
to
I
imported
that
project
and
then
I
tried
to
do
an
export
of
it
and
yeah
it
just
it
just
never.
It
was
just
blocking
everything.
So
in
that
case,
with
only
one
pod,
you
know
the
CPU
doesn't
climb
so
we're
kind
of
stuck
at
one
pod
and
we're
blocking
everything
else.
So
that's
no
good!
You
probably
won't.
D
A
What
what
if
so,
what
if
I
mean
we're
starting
here,
fairly
low
we're,
starting
here
with
one
pod,
which
is
not
even
close
to
comparable
to
what
we
have
in
prod,
so
we
don't
have
to
start
with
with
something
low.
Here
we
can
start
with
something
relatively
high.
Let's
say:
10
pods.
That
means
then
imports
that
can
be
handled
at
the
same
time,
right
and.
C
A
B
E
Just
to
kind
of
paint
a
bit
of
picture
around,
like
the
usage
pattern
that
we
see
in
project
export
right,
so
I
would
say
like
90
percent
but
I,
don't
know
the
exact
numbers.
But
in
my
head
it's
like
90%
of
all
project
exports
happen
between
four
o'clock
in
the
morning
and
10
minutes.
Past
4:00
in
the
morning
at
UTC
morning,
right
I.
D
E
D
E
E
E
Is
the
total
total
execution
time
for
priority
psychic
estimated
p95
queue
time,
because
maybe
it's
not
90%
but
like
every
morning
we
get
a
really
long
queue,
a
backlog
of
export
jobs
at
4
o'clock
right
and
the
reason
for
that
is
that
those
jobs
are
themselves
get
lab
CI.
They
kicked
off
I,
get
lab
CI
jobs
that
are
at
daily
that
are
all
run
at
4
a.m.
because
that's
when
out
daily
jobs
are
run
yeah.
D
D
E
B
D
Okay,
well,
I
think
I.
Think
like
the
challenge.
The
biggest
challenge
we
have
right
now
is
to
figure
out
how
we're
going
to
scale
there
or
how
many
pods
do
we
need
to
you
know
at
minimum
to
be
able
to
service
these
exports.
I
I
was
a
bit
surprised
that
I
did
this
very
large
export.
It
hung
around
for
a
pretty
long
time.
Do
you
know?
Is
there
shouldn't
there
be
a
timeout
for
how
long
these
jobs
run
before
they
were?
Is
there
no
timeout.
D
Is
this
was
at
least
30
minutes,
if
not
an
hour,
and
the
reason
why
it's
because
it's
pre
proud
and
we
have
NFS
and
I-
think
it's
just
like
something
was
taking
a
very
long
time.
I
didn't
dig
into
it,
but
to
me
that
sounds
like
we
need.
We
need
to
timeout
right.
We
can't
wait
a
lot.
Wait
that
long.
It's.
E
One
of
those
things
where
it's
kind
of
sunken
costs
right
and
people
probably
start
over
and
over
and
over
yeah,
but
I-
think
more
importantly
into
the
point,
which
is
a
like
a
totally
different
conversation
like
obviously
project
exports
are
a
real
mess.
At
the
moment
today,
I
find
out
about
coop
exports,
which
is
even
more
fun.
Yeah.
D
E
We
just
match
whatever
we've
got
in
production
like
that's,
fine
and
and
and
I
think
the
number
one
thing
is.
We
just
have
to
be
like
sitting
expectations
that
it
will
be
queuing,
but
there
it
queuing
right.
So
there
always
has
been
queuing
and
sometimes
jobs
will
take
like
15
minutes
to
schedule
and
that's
okay.
D
B
C
B
A
B
A
A
D
One
last
thing
is
that
I
I
tried
to
play
with
concurrency
and
read
up
a
little
bit
about
how
sidekick
handles
concurrency
and
that
suppose
you
can
have
jobs
on
multiple
threads
doesn't
matter
for
export
like
if
I
said
a
high
concurrency,
we
only
do
one
export
at
a
time.
Does
that
make
sense
to
everyone
or
not,
or
would
you
expect
to
have
more
than
one
export
happening
simultaneously
if
I
set
up
concurrency
Prasad
kick.
B
D
E
D
D
E
D
Okay,
well,
cool
I.
Think
that's
about
it
for
the
demo.
I,
don't
know
like
what
to
take
away
from
this
I
think
we
need
to
test
the
large
X.
The
large
project
export
a
bit
more
I.
Think
I'm
gonna
wait
until
we
do
that
on
staging
I'd,
also
like
to
do
some
more
like
apples
to
apples
comparison
between
VM
and
kubernetes.
We
can.
We
can
like
enable
two
bananas
for
a
bit
tests
and
project
exports.
D
C
E
D
E
D
E
A
There
are
quite
a
lot
of
challenges
around
around
this
I'm.
What
I'm
considering
like
there
are
a
lot
of
challenges
around
running
this
in
production.
The
way
it
is
right
now
can
we
like
have
one
place
where
we
have
those
challenges
written
down.
So
we
said
long
route
of
time
of
new
parts
we
mentioned
15
gig
discs.
A
We
mentioned
concurrency
and
couple
of
other
things
so
that
we
can
evaluate
what
kind
of
risk
we
are
willing
to
take,
because
if
we
go
down
the
route
of
resolving
every
single
problem
that
we've
just
seen,
knowing
that
this
already
exists
in
production,
but
by
pure
luck
of
us,
not
caring
about
limits,
it
works.
I,
don't
think
we'll
ever
finish.
This
yeah.
D
So
I'm
using
the
epic
for
that
I
had
to
dive
one
for
chart,
since
we
can't
do
this
cross
group,
epic,
linking
or
cross
group
issue
anything
I
think
so
I'm
just
using
this,
then
scarback
I
think
you're
aware
of
this
as
well.
So
you
should
just
add
them
here,
if
they're
not
to
him.
Otherwise
we
just
associate
them
to
the
epic
but
I'm
using
this
as
the
source
of
truth.
Epic,
all.
D
D
A
A
A
Don't
want
to
be
held
up,
yeah
cool
cool
awesome,
awesome,
demo,
careful
thanks,
thanks
from
doing
that.
It
exposed
to
really
really
great
things
here.
So
my
next
question,
for
you
would
be:
is
there
a
way
we
can?
Just?
Can
you
add
this
script
somewhere
in
CI
and
fire
this
off
at
it?
Well,
even
if
it
means
just
running
this
in
a
pipeline.
B
A
D
A
A
So
what
I'm,
trying
to
figure
out
here
is
how
can
we
actually
help
the
monitoring
team
with
their
dog
fooding
epic,
because
even
give
this
is
like
their
top
priority
for
the
court
quarter?
How
can
we
add
this
into
our
work
right
now,
so
that
it
doesn't
block
us
fully
but
also
provides
and
continuous
feedback?
While
we
are
working
on
this
that
we
could
update
in
our
in
our
process,
so
I'm
not
really
sure
what
kind
of
things
we
would
want
to
depend
on
when
it
comes
to
our
monitoring
stack,
especially
with
the
cyclic
you.
A
C
D
I
talked
to
I
put
this
in
the
agenda
here,
but
I
talked
to
Dov
on
Wednesday
and
I
think
what
Mary
just
said
like
they.
They
want
to
know
what
to
prioritize,
because,
like
they've
been
given
the
direction
that
they
want,
are
that
our
monitor
needs
to
be
more
like
Rehanna,
but
they
can't
just
copy
everything.
So
they
want
to
know
what
specific
things
do
we
want.
D
We
talked
a
bit
about
variable
templating,
which
I
think
would
make
it
so
that
we
could
have
a
little
bit
closer
functionality
and
some
of
the
you
know
the
kubernetes
monitoring
dashboards,
and
we
talked
a
little
bit
about
that
before
monitoring,
but
I
think
like
reaching
out
to
the
p.m.
he's
very
eager
to
talk
to
us
and
to
kind
of
get
us
to
open
up
issues
and
then
bring
it
to
his
attention.
So
he
knows
what
we
need
now.
A
With
what
you
said,
Jeff
just
now
open
up
issues
I'm
linking
the
epoch
that
I
promoted
some
time
ago,
where
we
already
work
with
them.
So
just
opening
up
issues
we
did
already,
and
this
is
why
I'm
kind
of
pulling
back
to
us.
So
if
some
of
this
that
you're
mentioning
right
now
is
still
not
there,
we
should
be
exposing
that
either
in
that
Deb
epoch
or
the
other
one
I
think
this
one
where
that
I
linked
dogfooding
a
single
metric.
A
So
far
back
if
there
are
more
items
like
this,
please
those
and
the
point
here
is
not
blocking
as
the
point
here
isn't
blocking
another
team.
While
we
are
already
working
on
this
right,
we
are
already
building
out
all
these
dashboards
right
now,
so
might
as
well
see
how
what
we
can
do
to
help
out
putting
this
in
in
a
product.
So.
E
A
You're
talking
about
it,
no
but
you're,
talking
about
Andrew
you're
talking
about
the
end
result
of
a
full
completed
product.
What
we're
talking
about
here
is
that
we
have
very
basic
things
here
and
they
don't
even
know
where
to
start
right,
like
that's
the
big
problem
like
if
you
tell
them
just
copy
this,
they
don't
know
where
to
start
so
we
kind
of
need
to
work
with
them
and
guide
them.
We
don't
have
the
experience
of
using
this
properly.
I
would
think
so.
A
A
D
A
Just
to
make
it
clear
it's
not
up
to
us
when
we
have
our
our
major
epic
to
try
someone
else's
smaller,
epic
or
however
big
it
is
it
doesn't
really
matter
it's
it's
not
up
to
us.
They
did
come
to
us
to
ask
us
for
advice.
So
now
it's
the
time
for
us
to
provide
help
right.
So
if
this
default
crap
is
actually
in
the
way,
we
should
provide
feedback
to
them
on
that
and
then
have
them
come
to
us
with
the
things
they
adapted
so
that
we
can
also
adapt.
A
B
E
A
Guys
guys,
please
don't
mix
things
up,
they
are
not
trying
to
beat
Ravana
they're
trying
to
get
to
a
point
where
this
is
usable,
you're
talking
about
a
product
that
is
rounded
and
completed
versus
part
of
the
product
that
is
just
starting
properly.
So
what
we
are
trying
to
do
here
is
to
provide
them
how
to
get
to
the
point
where
this
is
going
to
be
a
plausible
replacement,
not
saying
that
we
need
all
of
that
built
right
now.
D
E
What
I,
what
I
would
say
like
if
I
was
to
give
some
positive
encouragement,
is
like
take
a
look
at
the
that
Explorer
dashboard
in
Griffin
er,
like
let
us
mess
with
that
graph
right.
So
maybe
it's
just
like
prom
QL
and
we
can
just
sort
of
type
things
in
and
change
it
on
the
fly,
because
the
problem
is
that
the
moment
is
like
really
inaccessible.
It's
just
like
a
graph
and
and
if
we
want
to
dig
further,
we
don't
have
the
controls
to
do
it.
A
A
C
The
single
source
of
truth
for
chef
and
secrets
grain
has
put
together
an
excellent
proof
of
concept
using
the
help
file
project.
To
put
this
together
and
in
he's
got
a
working
POC,
he
created
a
small,
get
lab
secrets,
help
chart
that
is
able
to
grab
all
of
our
secrets
out
of
G
kms
appropriately
and
create
the
secret
objects
inside
of
our
clusters,
and
similar
he's
got
the
necessary
templating
done
in
his
same
POC
repo,
where
it
grabs
our
configurations.
A
C
Not
clear
to
be
either
haven't
fully
looked
into
that
epoch
at
all.
I
was
given
to
that
very
late
yesterday,
so
I
didn't
actually
look
into
it,
but
this
was
the
epic
day
and
I
had
a
conversation
about
what
we
were
trying
to
look
for,
because
I
heard
through
Graham,
they
were
trying
to
use
helm
file
so
as
part
of
their
cluster
management.
C
Their
cluster
application
management
feature
they're
trying
to
work
on
will
include
helm
file
to
some
capacity,
but
they've
got
a
couple
of
other
hurdles
that
they
want
to
work
through
before
they
finally
get
to
some
like
this
I.
Imagine
the
part.
That's
gonna
kind
of
hold
us
up
immediately
off
the
bat
is
that
currently
they
don't
have
a
method
to
manage
multiple
clusters
within
one
project,
with
what
they're
trying
to
provide.
C
So
far,
it's
only
limited
to
one
repo
is
equal
to
one
Korea's
cluster,
for
some
reason,
so
I
think
that's
it's
still
a
work
in
progress
from
them,
but
I
feel
like
if
we
went
the
helm
file
route
today.
That
gives
us
a
leg
up
where
we
could
easily
integrate
with
something
that
they
finally
get
rolled
out,
that
we
could
probably
potentially
utilize
in
the
future
and.
A
C
A
I'll
I'll
upload
this
and
ask
Daniel
to
take
a
look
at
the
recording
and
provide
us
some
feedback.
The
thing
I
would
like
to
avoid
is
making
a
decision
to
go
with
one
project
product
going
completely
in
another
direction
and
us
tying
ourselves
again
into
yet
another
workaround
that
we
we
keep
doing
so
I
think
grain
can
continue
working
on
the
POC.
That's
not
the
issue!
What
we
could
you
is
theoretically
focus
on
one
item
that
we
could
remove.
That
would
help
us
unblock
us
right.
We
don't
have
to
port
all
the
case.
Etl
stuff.
A
D
I
mean
I,
think
I,
put
thoughts
in
the
issue
and
I'm
just
worried
that
this
could
be
a
distraction
from
what
we
the
problem
we
set
out
to
solve,
which
is
having
a
single
source
of
truth
for
secrets.
Now.
D
They
do
PCI,
so
I
think
without
solving
that
problem,
we
can't
really
have
a
single
source
of
truth,
because
what
is
it
going
to
be
like
an
engineer?
Is
gonna
update
a
secrets
for
chef
in
his
work
station
and
then
go
to
a
pipeline
and
write
it
and
what
doesn't
seem
like
it's
gonna
sustain
us
something
better
than
that.
A
A
A
B
D
Think
it
remains
to
be
seen
what
the
workflow
will
look
like,
and
we
can
take
a
look
at
that
afternoon.
After
we
finish
or
after
we
decide
on
what
we're
gonna
do
for
the
kubernetes
stuff
is
just
like.
I
see
two
possibilities:
one
is
with
this
witness
proposal.
I
see
two
possibilities:
you
update
a
secret
in
chef
and
then
you
do
a
deploy
or
you
wait
until
the
next
deploy
and
then
you
see
the
difference
and
you
apply
it
or
you
have
a
pipeline
running
continuously.
D
That
constantly
is
looking
at
the
chef's
secrets
and
updating
our
kubernetes
cluster
and
notifying
us
or
notifying
us
when
there's
a
difference
which
I
think
yeah.
That
was
one
of
the
early
proposals
and
I
I,
don't
know
I
think.
Ideally,
we
would
just
have
a
single
pipeline
that
updates
secrets
in
both
locations
and
then
triggers
the
kubernetes
pipeline
afterwards,
but
I'm
just
really
worried
about
getting
out
of
sync
and
come.
You
know
being
surprised
when
we.
A
A
So
this
RCA
that
I'm
supposed
to
be
creating
right
now
could
expose
exactly
what
you're
saying,
Jeff
and
could
actually
support
a
bigger
effort
going
forward.
The
problem
I
have
with
that
is
that
puts
another
huge
delay
on
the
team
of
two
that
are
currently
doing
this
migration
right
and
if
I
cannot
source
more
help
from
the
rest
of
New,
York
or
Department.
It's
going
to
put
a
big
dent
in
in
what
we
can
do.
C
I
think
the
use
of
helm
file
using
the
POC
coming
from
grain
will
only
solve
half
of
this
problem.
It
at
least
gets
us
to
the
point
where
kubernetes
is
using
the
same
configurations
that
are
shelf
configurations
are
using
how
we
make
those
changes
is
still
up
for
debate
and
that
I
think
it
that's
a
completely
different
conversation
that
we
need
to
have.
C
A
All
right
how
about
this?
Let's
do
a
couple
of
things
in
parallel:
sync
up
skarbek,
sync
up
with
grain
and
explain
the
one
thing
that
we
are
trying
to
resolve
with
helm
file
and
have
that
as
a
focus
of
not
the
POC,
but
of
how
we
could
be
doing
this
with
the
current
infrastructure
that
we
have
so
resolve.
One
problem:
don't
replace
anything
else,
I'll
in
the
meantime,
also
sync,
with
Daniel
and
asking
for
feedback.