►
From YouTube: GMT20200117 k8s migration sidekiq: project export queue
Description
Part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/143
B
Unfortunately,
we
don't
have
a
whole
lot
to
demo.
So
the
good
news
is
is
that
we
have
the
psychic
configuration
done.
We
have
the
psychic
secret
configuration
done.
The
bad
news
is,
is
that
were
blocked
on
an
issue
that
prevents
us
from
running
project
export
in
the
stage
in
kubernetes
cluster
I
can
give
you
a
quick
like
look
at
the
configuration
there
isn't
much
to
see
here.
Just
because
you
know
we
already
have
this
configuration
for
production.
B
But
what
is
clear
is
that
you
know
there's
there's
a
lot
of
config,
that's
duplicated
between
chef
and
kubernetes
cluster,
for
example.
The
giddily
can
fake
and
all
of
the
shards,
and
this
is
going
to
be
even
more
complicated
for
production.
This
is
just
for
staging
where
we
have
like
a
single
file
note
for
all
of
the
shards
that
were
used
at
the
time
that
we
did
the
database
import.
B
B
B
D
B
I,
don't
know,
I,
don't
know
what
the
answer
is.
Another
option
would
be
to
have
a
template
that
sources
the
data
directly
from
chef.
So
we
would
put
the
data
in
you
know,
role,
attributes
first,
that
would
get
uploaded
the
chef
and
then
we
would
have
a
pipeline
that
generates
the
kubernetes
configuration
from
chef
or
as
much
as
we
need
to,
and
then
we
would
inspect
it
and
then
apply
it.
B
B
We
uploaded
object,
stores
and
encrypted
again,
and
we
have
a
you:
have
a
helper
script
for
that
I
mean
like
we
could
probably
just
have
a
yet
another
pipeline
that
pulls
down
the
secrets
from
object,
storage,
decrypts
them
and
then
syncs
them
with
kubernetes
secrets.
You
know
this
is
a
solved
problem.
It's
just
that
it
seems,
like
you've
been
talking
about
going
to
vault
for
as
long
as
I've
been
working
here.
That's.
B
B
A
A
Do
we
have
to
if
the
focus
is
a
tiny
bit
and
say
that
community's
migration
is
being
delayed,
because
we
are
missing
crucial
components
so
that
I
can
rally
some
support,
or
do
we
have
a
quick
fix
yet
again
to
continue
like
chugging,
along
with
our
migration?
While
this
is
happening
in
the
background
like
that
is
the
only
thing
I
cannot
decide
at
the
moment,
you're
more
familiar
with
state
of
affairs.
B
Generally
secrets
are
fairly
stable
but
and
there
aren't
as
many
secrets
for
just
sidekick
like
we
don't
have
to
worry
about
all
of
the
certificates
and
everything
like
that,
but
there
certainly
a
bunch
of
them
like
we
have
get
early
encryption.
We
have.
You
know,
database
credentials,
Redis
credentials,
we're
already
using
kubernetes
secrets
for
Redis
credential
for
Melbourne,
for
example,
and
for
registry
certificate.
B
A
B
Problem
the
problem
is,
is
that
we
want
to
maintain
a
single
source
of
truth
for
secrets
between
chef
and
the
cluster.
That's
the
crux
of
the
issue,
not
I
think
in
general
secrets
management,
it's
more
like
a
single
source
of
truth.
So,
whatever
we
whatever
system,
we
change
doing
to
change
on
both
sides.
If.
B
So
chef,
currently
what
it
what
it
does
is
right.
It
has
JSON
files
that
are
not
like
storage
encrypted
with
G
kms,
it's
a
very
simple
design,
but
it
works.
Okay.
We
could
obviously
like
just
source
the
secrets
from
those
JSON
files
in
a
pipeline.
Decrypt
them
a
JK,
that's
just
like
chefs
does
and
they
would
have
a
single
source
of
truth.
This
this
whole
set
up,
though,
was
always
supposed
to
be
a
temporary
thing
until
we
have
vault.
B
A
B
B
C
Yeah,
but
we
still
have
other
things
that
are
blocking
us
from
reaching
production
today,
yeah
so
I
think
we
could
probably
continue.
We
could
work
on
the
secrets
in
configuration
in
parallel
with
us,
trying
to
get
auto,
deploy
working
because
I
don't
want
to
go
to
production.
If
we
don't
have
auto
employee
working.
A
C
On
that
same
token,
I've
got
a
very
controversial
opinion
on
this
overall
I.
Don't
even
like
this
opinion,
but
I
think
we
should
allow
us
to
go
to
production
without
a
single
source
of
truth,
just
to
enforce
the
fact
that
when
we
do
accidentally
create
an
outage
because
we
didn't
update
something,
this
pushes
that
need
for
that
corrective
action
higher
on
the
priority
list.
B
A
B
B
I
may
need
to
tweak,
but
I
think
what
we
have
now
is
good
enough
for
staging,
and
then
they
think
we
did
come
across
the
issue
that
we
have
different
database
and
points
for
sidekick
and
the
front
end
sidekick
connects
to
the
side,
PG
bouncer
and
the
front
end
connects
to
the
main
PG
bouncer
in
charts.
There's
only
the
option
for
one
database
endpoint
for
Postgres,
so
we're
gonna
need
to
configure
that
I.
Think
it's
going
to
be
simple
in
charge.
B
A
E
E
D
B
C
A
A
Number
that's
going
to
get
rolled
out
across
clusters.
We
can't
do
that
easily
with
side
kick
which
is
tightly
tied,
obviously
to
the
rails
project
not
even
tightly
its
part.
It's
the
rails
project.
Basically,
so,
in
order
for
us
to
be
able
to
yeah,
basically
do
what
job
is
now
are
going
to
explain.
We
need
to
at
the
same
time
when
we
are
building
the
omnibus
package
out
of
those
specific
shots,
also
build
the
images
that
we
were
going
to
be
reused
in
the
charts
going
further.
B
Sure
so
so
what
we
had
to
do
is
I
created
another
deploy,
branch,
I,
updated
the
component
versions,
then
I
tagged
it
I
had
to
make
some
small
changes
to
prevent.
I
only
wanted
to
see
he
jobs
to
runs
and
I'm
sorry
only
when
they
eat
he
chops
to
runs,
and
that's
all
I
care
about.
So
I
had
to
change
the
CI
configuration
to
say,
like
Ronaldo
deploy.
All
they
care
about
is
EEE,
and
that
was
that
was
pretty
much
it.
We
had
one
problem.
We
ran
into
one
bug
where
assets
worth
compiling.
B
What
I
did
for
now
is
I
so
for
auto,
deploy
we'd
have
this.
We
have
this
thing
in
the
CI
config,
where
we
have
a
check
and
a
CI
config.
If
it's
an
auto
deploy
tag,
we
compile
assets,
and
we
did
this
for
omnibus
and
that
was
carried
forward
over
to
CNG.
The
reason
we
did
this
was
so
that
we
didn't
wait
for
assets
to
be
uploaded
to
registry
before
doing
the
auto,
deploy
I'm
trying
to
remember
if
there
were
more
reasons
besides
that,
but.
B
A
B
So
anyway,
because
of
the
problem
of
compiling
assets,
I
just
used
the
asset
image
instead,
so
I
I
commented
out
on
this
branch
that
I
was
testing
on
I
commented
out
logic
that
says
if
this
is
an
auto,
deploy
and
compile
assets.
Instead,
it
just
pull
down
the
image
from
the
registry,
which
is
fine,
because
all
the
deploy
ease
is
agreeing
and
commit
right
now,
so
the
image
exists.
So
that's
fine
and
then
at
the
end
of
it,
if
I
click
on
here
oops,
not
here,
I
want
the
sidekick
here.
B
The
terrible
thing
here
is
that
we
have
yet
another
auto
deploy
format
for
the
tag
so,
and
this
won't
make
sense
for
Andrew
and
Bob
for
skarbek
and
Mary,
and
you
can
see
that,
like
you
know
now,
we
have
an
additional
format
where
it's
12-7
instead
of
12.7
and
yeah.
So
we
have.
This
is
like
the
third
generation
you.
B
It's
because
it
needs
to
be
a
host
name
valid
for
a
host
name.
I
guess
like
if
you
can't
not
dots
in
it,
I
don't
know
exactly
where
this
logic
is
I
need
to
look
into
it,
but
maybe
is
valid.
As
a
registry
tag,
you
can
have
dots
in
a
registry
tag
but
I
think
somewhere
else.
There
is
some
like
changing
dots
and
the
dashes,
but
this
is
what
you
end
up
with.
This
is
the
so
this.
A
B
B
B
A
Are
special
significance
for
dashes
as
well?
You
know,
okay,
this
is
a
sight
problem
that
I
don't
want
to
deal
with
honestly,
like
flashes,
then
I'll
survive
with
dashes
in
doc.
Yeah,
okay,
well,
I,
okay,
cool
thanks
for
the
updates!
Sorry
that
it's
this
painful,
but
this
is
the
world
we
live
in
I
just
want
to,
because
we
have
scalability
also
here
as
well.
A
Skarbek
would
you
mind
like
going
put
me
on
the
spot
here,
but
maybe
just
explain
like,
in
short,
the
challenges
we've
seen
by
selecting
this
project
export
queue.
While
we
were
developing
while
we
were
trying
to
put
it
in
in
a
staging
brother,
maybe
just
kind
of
run
us
through
from
the
application
side
what
you
what
you
saw
as
issues
putting
in
the
spot
here,
if
you
can't
you
can.
C
C
C
C
The
instance
sizes
are
slightly
different,
but
the
max
memory
usage
per
worker
puts
us
in
line
with
an
individual
node.
So
right
now,
we've
got
node
sizes
that
are
51
gigabytes
of
RAM,
but
they
run
for
workers,
whereas
our
nodes
inside
a
gke
have,
if
I
recall,
correctly,
13
or
so
gigabytes
of
RAM
something
relatively
low,
but
I
can't
remember
off
the
top
my
head,
but
with
a
max
memory,
kubernetes
is
setting
the
max
memory
for
the
request
or
the
limit
excuse
me
of
the
pod
to
16
gigabytes.
C
C
D
A
D
D
A
B
So
we
do
have
a
load,
testing
issue,
open,
I,
open
up
a
load
testing
issue
this
week.
I
think
the
most
important
thing
for
me
is
going
to
be.
We
need
to
create
some
larger
projects,
do
something
exports
and
try
to
see
where
things
fall
over
on
IO,
because
my
biggest
concern
is
like.
We
have
a
lot
of
exports
in
the
queue
and
we're
getting
hammered
on
IO
for
the
small
number
of
nodes
we
have.
D
The
other
thing,
like
the
one
thing
about
background
processing
jobs,
is
that
they're
really
easy
to
scale
and
on
basically
Keeling's
right.
Like
that's
the
beauty
of
background
processing,
is
you
know
how
much
work
you
need
to
get
done
because
in
quite
a
few?
And
so
that's
generally
like
a
much
better
metric
than
CPU
memory
like
because
if
you
have
lots
of
jobs
that
are
waiting
to
get
run,
you
scared
up
more
workers
and.
D
D
Yeah
we
should
be
the
ones
but
because
effectively
you
need
a
prom
ql
query
for
that
right
and-
and
that
would
be
something
you
know
we
got
experience
a
sidekick
who
got
experiences
funky.
Well,
we've
got
the.
Obviously
everything
is
really
gonna.
Get
another
comment
we
can
model
used
to
model,
so
it
make
more
sense
to
do
that
than
anything.
B
A
D
Yeah,
actually,
we
should
talk
about
it
because
effectively.
What
we
are
saying
is
that
the
language
that
we're
talking
about
in
the
last
meeting-
you
know
the
selector
language
will
need
to
be
applied
to
kind
of
create.
The
the
Cuba
nearest
opponent
rather
than
to
to
configure
inside
a
pod
is
considering
the
putz
I.
Don't
know
if
I've
got
the
right
comments
and
stuff,
but
are
you
following
what
I'm
saying
it's
Kovac?
No
okay,
so
it's
comics!
Okay!
D
So
what
I'm
saying
is
previously,
in
my
mind,
sidekick
would
kind
of
blew
it
up
with
psychic
cluster
you
boot
up
inside
a
pod.
It
would
read
the
the
language
that
would
say:
I'm
gonna,
take
latency-sensitive
cpu-bound
jobs
and
run
them
in
my
psychic
pod
run
them,
but
actually
now
what
we're
saying
is
that
when
we
configure
Cuban
Eddie's,
we
almost
give
that
conflict
to
Cuban,
Eddie's
and
then
Cuban
Eddie's
will
basically
use
that
to.
D
To
kind
of
generate
a
conflict
that
starts
side,
kick
with
the
the
cues
expressed
by
that
that
expression,
and
so
we
kind
of
need
to
expand
it
before
you
know.
Cuz
side
kick
doesn't
know
how
to
expand.
Those
cues
insert
an
expression
into
the
keys,
so
either
we
need
to
put
that
into
side.
Kick
all.
We
need
to
get
side
kick
cluster
inside
the
part
or
we
need
to
make
Cuban
Eddie's.
You
know
this.
A
A
A
So
we
have
memory,
team
is
driving;
they
big
just
got
the
approval
the
other
day
to
move
sake,
cluster
to
core,
so
they're,
moving
that,
probably
very
soon
from
our
side
from
scalability.
We
have
the
changes
on
how
we
are
actually
creating
the
queues
and
then
from
the
delivery
side.
We
should
be
changing
the
ways
that
the
charts
are
consuming.
A
D
Mean
the
other.
The
other
option
is
that
if
you
start
sidekick
cluster
with
this
is
kind
of
going
into
the
weeds
a
bit,
but
like
have
this
kind
of
thing
about
like
having
a
process
manager
inside
a
Cuban,
a
despot,
that's
running
like
a
single
process
underneath
it,
but
maybe
we
could
get
at
that
it.
Just
no
that's
fine
I
mean
we
just
you
just
keep
running
with
one,
it's
fine,
it's
a
little
bit
of
extra
kids
and
then
young
new
brand
sidekick
cluster
everywhere
and
and
inside
the
pod.
A
D
A
D
A
E
A
A
It
you
did,
you
did
you
exposed
already
a
couple
of
things?
The
other
thing
I
wanted
to
just
highlight
is
skarbek.
Did
some
work
on
how
psychic
jobs
behave
inside
of
communities
when
the
pods
get
killed,
so
that
might
be
very
interesting
to
share
in
this
call
as
well.
If
others
are
interested
in
hearing
it,
I.
C
Feel
like
it
may
be
repeat,
information
for
those
that
are
familiar
with
sidekick,
but
essentially
I
did
two
testings
where
sidekick
was
killed
in
an
abrupt
manner.
You
know
sitting
was
it
sick
hill
while
it
was
processing
a
job,
and
we
use
sidekick
reliable
fetcher
to
determined
that
a
job
got
orphaned
in
some
way
shape
or
form.
So
in
this
particular
case
it
took
roughly
40
minutes,
but
eventually
a
new
pod
picked
up
work
that
was
abandoned
in
the
queue.
The
other
method
was
just
testing.
Whether
sidekick
was
processing
a
job.
C
It
was
killed
in
a
clean
manner,
so
sending
a
cig-
and
in
this
particular
case
I
kick
did
exactly
what
we
wanted
to
do.
Where
it
would
it
detected
that
psychic
was
processing
a
job
it
stopped.
The
processing
is
through
the
work
back
into
the
queue
and
the
next
pod
that
was
available
picked
it
up
pretty
much
immediately.
So
the.
C
C
So
the
termination
will
send
a
cig
and
if
you
know
nothing
happens
with
that
time
period,
kubernetes
I
think
it's.
The
30
second
window
to
kubernetes
will
automatically
send
a
signal
to
it.
So,
even
if
we
can't
clean
up
for
whatever
reason
like
the
pod
is
just
too
overloaded
for
X
reason,
we
still
have
the
ability
using
sidekick,
reliable
fetcher
to
pluck
jobs
that
were
abandoned.
That
would
be
detrimental
for
very
important
queues
where
jobs
need
to
happen.
Asap
but
I
think
that's
a
emergency
scenario,
kind
of
situation,
yeah.
E
I
think
we
could
also
make
different
decisions
based
on
the
type
of
key
we're
working
on
like
some
of
them
yeah
the
ASAP
stuff.
They
are
jobs
that
should
not
take
minutes
to
run.
They
should
take
seconds
so
then
I
had
was
you
mentioned
that
if
we
need
to,
if
we're
relying
on
reliable
fetch,
how
come
like
whether
the
40
minutes
delay
come
from
before
the
job
gets
picked
up
again?
I,
don't.
C
C
C
Like
to
continue
trying
to
work
on
enabling
Auto
deploy
I've
been
working
that
for
the
last
month
at
this
point,
so
I
would
like
to
try
to
continue
that
and
try
to
get
that
across
the
finish
line.
So
hopefully,
if
that
all
goes
well
next
week,
we
could
all
employ
sidekick
and
to
say
staging
or
something
that
effect.
Yes,
that's
what
I
want.
C
C
The
load
testing
was
something
that
he
recently
created.
I
think
that'll
be
a
good
wise
choice
to
pull
I,
don't
know
if
we
would
have
time
to
complete
that
and
doing
a
database
configuration
audit
would
be
wise.
So
if
anything,
I'm
gonna
move
the
priority
or
ship
the
priority.
So
doing
a
database
configuration
audit
for
sidekick
and
staging
just
to
make
sure
all
things
are
kosher
like
we
need
it
to
be,
and
then
a
load
test,
I
think
are
two
other
good
items
that
we
can
form
you.
A
B
D
C
C
A
Cool
I
put
that
as
gyros
next
task.
So
it's
easy
when
people
are
not
in
the
coal
or
you
can
just
assign
type
things
to
them,
so
he
has
to
complete
the
project
expert
queue,
adding
to
staging
so
connecting
the
staging
to
kubernetes
cluster
and
then
start
the
investigation
on
DKMS
source
of
truth
for
secrets
between
between
kubernetes.
A
C
Just
to
supplement
one
of
the
items
I
mentioned
this
database
audit
we've
got
configuration
items
inside
of
our
Postgres
configuration
for
omnibus
that
we
don't
have
the
ability
to
configure
inside
of
helm.
So
this
issue
is
to
figure
out
what
the
default
for
these
options
are
and
to
maybe
supplement
our
helm
charts
with
these
configuration
options
to
make
those
changes
if
we
deem
them
necessary.