►
From YouTube: JWT auth API availability and database load balancing
Description
Related to https://gitlab.com/groups/gitlab-org/-/epics/5215.
A
Okay,
so
some
of
you
were
on
the
previous
meeting
related
with
the
container
registry
and
high
availability
other
knots,
including
myself,
but
this
one
is
to
talk
specifically
about
the
the
rails
api
and
not
really
the
registry.
But
given
that
the
registry
depends
on
the
rails,
api
for
authentication,
that's
also
related
with
higher
availability.
A
A
Okay,
so
one
of
the
things
that
we
have
discussed
when
the
concerns
around
the
registry
availability
arose
was
the
availability
of
the
authentication
api.
So
we
created
an
epic
to
investigate.
How
could
we
improve
the
availability
of
that?
More
specifically,
the
jwt
out
route
on
the
rails
api,
and
this
epic
here
has
a
long
trail
of
discussions
with
stan,
camille
and
others
as
well,
and
we
got
to
the
conclusion
that,
contrary
to
what
we
were
expecting,
the
api
route
doesn't
rely
on
the
database
replicas.
A
Instead,
it
requires
a
connection
to
the
primary
database.
Also,
the
only
required
queries
are
select,
and
this
has
impact
on
availability,
because
whenever
there
is
a
database
failover
database
incident
due
to
a
failover,
the
authentication
api
won't
be
able
to
serve
any
requests
if
the
primary
is
down
and
for
as
long
as
the
primary
is
down.
B
B
So
you
know
I
can
see,
there's
been
a
lot
of
discussion
on
here
about
having
this
particular
route
inside
the
monolith
having
a
higher
availability,
but
that's
not
the
the
service
level
that
the
infrastructure
team
has.
We
have
it
for
the
monolith
and
we
can't
sort
of
provide
more
availability
for
some
endpoints
over
others,
and
if
we
want
to
do
that,
it
has
to
be
outside
the
monolith.
B
So
I
just
want
to
raise
that
because,
because
it
feels
like
that
needs
to
be
said
early
on
and
and
obviously,
if
we're
going
to
build
this
up
and
make
it
stronger,
there
are
ways
that
we
can
do
that.
But
the
reason
availability
goes
away
is
not
because
that
endpoint
fails
it's
because
the
deployment
breaks
or
like
we
had
last
week,
we
had
sick
faults
that
take
rails
down.
It's
not
something
where
we
can
protect
one
single
end
point
and
give
it
higher
availability.
B
If
we
want
to
do
one
endpoint,
we
have
to
do
the
whole
application
and,
frankly,
if
we
start
taking
the
whole
application
beyond
like
99.95,
you
basically
got
to
go
away
from
daily
deploys
and
that
has
a
very
big
impact
on
our
ability
to
to
iterate.
So
that's
just
from
an
infrastructure
point
of
view.
Like
you
know,
we
can't
do
magic
things
for
specific
endpoints
inside
a
inside
a.
A
Yeah
yeah,
but
that's
that's
another
point,
and
it
was
one
suggestion
from
camille
is
that
we
could
discuss
using
a
separate
fleet
just
for
these
rods,
so
it
from
a
separate
fleet,
so
it
wouldn't
be
affected
by
other
long-running
requests
causing
causing
slowdown
on
the
server.
A
So
that
would
be
an
option
and
the
other
option
or
a
compliment
would
be
making
these
routes
use
only
database
replicas
to
to
get
the
data
because
those
are
more
available
than
the
primary,
which
is
a
single
server,
and
that's
basically,
what
we
have
raised
for
discussion.
One
see
if
we
can
make
these
rods
use
only
the
database
replicas
and
not
the
primary
server
and
consider
using
a
separate
fleet
to
serve
this
road.
Specifically,
those
are
the
two
main
points.
B
Yeah
from
I
mean
it's,
it's
not
just
the
separate
fleet
right
because
there's
also
like
how
do
we
deploy
that?
Do
we
deploy
it
at
the
same
time
as
we
deploy
everything
else,
so
I
just
want
to
you
know
if
we
setting
an
expectation
with
the
customer
that
we're
going
to
have
like
five
nines
or
six
nines
on
this,
like
that's,
there's
a
bigger
discussion
around
that,
because
that's
very
difficult
technically
for
us
to
actually
do
so.
B
I
just
want
to
set
the
expectation
that
you
know
that
it's
a
difficult
thing
there,
but
I,
but
I'm
all
for
making
it
more
resilient.
Just
that's
the
other
side
of
it.
Yeah.
A
A
And
by
a
bit
so
if
we
could
get
the
api
to
be
more
resilient
and
more
stable
and
always
match
the
sla
which
is
99.95,
that
would
be
a
good
first
step,
but
at
least
from
the
way
I
see
it
as
long
as
we
depend
on
a
single
database
server
to
serve
these
requests,
we
are
always
prone
to
database
followers,
which
can
take
a
long
time.
A
Yeah
so
that,
first,
because
on
the
on
the
load
balancer
code
that
we
have
from
what
I
saw,
we
only
send
a
few
methods
to
a
width
only
connection
and
for
for
the
remaining
ones.
We
always
default
to
a
read
right
connection
and
from
invoking
the
routes
with
with
the
primary
down.
A
We
could
see
that
there
are
a
few
methods
that
are
required
and
right
now
they
are
all
going
to
the
to
the
to
the
read
write
connection
by
by
default,
so
one
one
of
things
is
trying
to
identify
if
any
of
these
methods
can
be
implemented
on
the
load
balancer
to
use
a
read,
only
connection
if
it
is
possible
or
not-
and
there
is
other
things
as
well.
I
found,
for
example,
that
some
specs
require
the
the
api
to
use
a
primary
server
and
one
of
them
I
linked
it
here.
A
This
relates
to
an
issue
from
from
two
years
ago,
where
apparently,
there
was
some
replication
like
on
on
ci
bills.
Basically,
when
there
was
a
builds,
the
the
token
use
it
for
that
bill
sometimes
was
not
replicated
on
time
to
the
replicas,
which
causes
the
pipelines
to
fail,
and
I
also
found
found
an
mr
from
camille
because
of
this,
which
made
this
authentication
rot
when
using
the
gitlab
ci
token
to
stick
to
the
primary
database
server
instead
of
going
to
to
a
really
long
if
that
would
be
possible.
A
But
at
the
same
time
I
also
found
that
this
relates
to
a
previous
issue
due
to
replication
lag.
And
apparently
that
like
was
resolved
before
those
mrs
that
changed
the
routes
and
it
was
related
with
the
ssl
compression
so
apparently
disabling,
that
fix
the
the
replication
lag.
But
even
though
this
mr
and
another
one
were
merged
afterwards
possible
to
prevent
this
from
making
it
again.
C
Some
authenticated
action
like
knowing
this
like
this
fact
from
on
happening.
So
one
thing
to
consider
like
with
the
replication
like?
Is
it
safe
really
to
perform
like
this?
Let's
say:
github
ci
token
authentication
on
the
replica,
where
there
is
a
case
that
this
token
may
be
already
like,
may
have
already
revoked
permissions
at
that
point,
that
you
are
tracking
that
because,
like
if
you
have
a
replication
like
you,
are
not
sure
if
it
has
or
doesn't
have.
Probably
this
is
not
the
issue,
but
this
is
something
also
to
consider
that.
C
B
So
camille
is
there
a
way
that
we
can
sort
of
say
that
if
you
remove
a
token,
we
have
like
we'll
say
that
that
within
500
milliseconds
or
a
second
or
two
seconds
that
client
will
no
longer
get
access
and
not
say
it's
like
an
instantaneous
thing
and
then
kind
of
coupled
with
that
is
there
a
way
that
we
can
make
database
calls
and
on
the
database
call
almost
have
some
sort
of
declarative
way
of
saying
only
you
know
only
allow
this.
B
If
the
database
lag
to
this
replica
is
less
than
that
defined
well
set
out,
you
know
lag
because
then
you,
then
you
can
sort
of
safely
say
you
know
we'll
use
the
replicas.
The
other
thing
from
a
database
point
of
view
is,
you
know
we
have
a
lot
of
load
on
the
database
primary
and
it's
a
limited
resource
and
we're
actually
running
out
of
resource
on
that
quite
quickly
and
anything
that
we
can
do
to
move
queries
off.
That
is
is
really
really
helpful.
So
there's
another
aspect
to
this
as
well.
C
Yes,
so
I
I
fully
agree,
I
think,
like
we
can
find
the
solution
like
to
make
to
make
it
like
still
at
her,
because,
like
I
mean
it
kind
of
applies
to
everything
like
I
know
that
we
have
a
tracking
of
the
of
the
replication
lock.
I'm
not
sure
if
we
have
the
ability
to
see
that
in
the
let's
say
seconds,
because
this
is
like
the
binary
pointer
offset
in
the
binary
pointer
of
the
log.
C
B
I
don't
know
what
the
default
there
is
a
default,
like
you
say,
like
there's
a
certain
threshold,
and
when
you
get
over
that
threshold,
the
the
secondary
will
not
be
used,
and
I
don't
know
what
that
number
actually
is,
because
that
might
be.
If
that's
one
second,
then
that
would
probably
be
sufficient
to
be
able
to
say.
Well,
if
you
delete
a
token,
you
know,
clients
can
still
use
it
for
one
second,
and
that's
just
the
way
that
you
know
gitlab
operates.
A
We
could
also
try
retry
the
queries
on
on
the
secondary
servers
if
they
fail,
if
a
token
is
not
found.
For
example,
we
retry
on
the
secondary
again,
at
least
once
and
then,
as
a
last
resort
measure,
we
could
go
to
the
primary
before
returning
any
horse
response
to
to
the
client.
That
could
be
another
option
as
well.
C
I'm
kind
of
thinking
jaw,
if
am
I
pronouncing
your
name
correctly,
yeah,
it's
close,
it's
close.
So
what
is
what
is
the
proper
pronunciation?
C
Okay,
so
I
there
is
like
one
very
important
aspect
that
you
kind
of
notice,
that,
like
we
have
methods
that
make
us
use
the
primary
and,
like
I'm
kind
of
thinking,
that
it
kind
of
did,
go
unnoticed
for
a
very
long
time
and
like
basically,
even
if
we
think
about
our
logging.
We
have
no
way
in
our
logs
to
know
exactly
how
many
requests
use
primary
versus
replicas.
C
So
I
think,
like
my
really
first
suggestion
would
be
based
on,
like
your
findings
of
the
methods,
is
to
actually
introduce
a
way
to
log
that
into
kibana
to
know
how
many
I
mean
percentage
of
the
request
fallback
to
use
the
primary,
because
of
maybe
something
new.
Introducing
the
code
base
that
we
did
not
anticipate.
B
B
Sean,
no
I'm
not,
but
but
sean
did
the
original
work
of
adding
the
the
db
counts
and
db
durations
and
the
precedent
for
redis
is
we
actually
have
like
db?
B
B
A
A
So
this
is
the
the
load
balancer
code
and
we
can
see
here
only
the
methods
listed
here
are
surveyed
with
a
read-only
connection
and
anything
that
it's
not
here
will
be
will
fall
back
to
to
read
right
connection.
A
So
this
is
basically
the
reason
why
for
this
route
it
requires
a
a
primary
and
a
read,
write
connection,
because
all
these
methods
are
are
called
during
requests
and
none
of
them
are
listed
as
safe
for
sticky
reads
yeah.
So
this
is
another
thing
that
needs
to
be
investigated.
Whether
or
not
those
methods
can
be
made.
B
A
We
have
not
investigated
each
one
of
them.
This
is
one
of
the
the
action
items
for
for
this
issue.
There
are
some
which
are
just
selects.
I
saw
them
for,
for
example,
migration
context.
This
is
just
a
select
about
the
the
state
of
migrations
quote,
I
believe,
is
just
select
as
well
this
one
I
saw
the
code
and
I
believe
it's
the
same,
I'm
not
sure
about
prepared
statements.
A
B
I
would
suggest
that
you
want
to
get
the
instrumentation
in
first,
that
you
know
the
splitting
out
the
db
read
cons
and
db
right
cards
into
into
primary
and
secondary
roles,
and
the
reason
is
because
then,
when
you
make
this
change,
presuming,
that's
not
a
hard
change
to
make,
which
I
don't
think
it
will
be
because
then,
when
you
make
this
change,
you'll
be
able
to
see
like
what
impact
it's
had.
B
A
D
I
have
one
question
to
xiao
and
hailey,
because
I
know
that
there
are
a
lot
of
things
in
flight
that
the
package
team
is
supposed
to
handle.
We
do
have
this
postgresql
metadata
database
for
container
registry.
D
There
is
this
idea
of
building
the
on-premises
caching
proxy
that
we
are
still
exploring,
and
now
we
are
thinking
about
optimizing,
jwt
and
point
for
registry
and
obviously
like
all
of
these
are
very
important
problems
to
solve,
but
I'm
kind
of
worried
of
like
the
capacity
of
the
package
team,
because
we
seem
to
take
a
step
at
many
different
problems
at
once,
and
I
I
feel
like
with
the
current
capacity
it
might
be.
Actually
it
might
be
difficult
to
bring
anything
to
condition.
So
what
is
your
respect
from
that
channel?.
A
It
doesn't
really
have
anything
to
do
about
the
registry
itself
as
an
application
as
a
google
application.
This
is
just
about
the
rails
api,
so
this
is
stuff
that
the
rails
engineers
in
package
can
help
with.
So
this
is
not
something
necessarily
that
myself
or
ellie
would
do
so.
That's
for
the
authentication
api
and
then
for
the
registry.
Well,
the
metadata
is
already
in
progress
and
close
to
completion
and
regarding
the
on-prem
cache
proxy.
A
We
still
don't
know
if
that's
desirable
from
customers
or
not,
and
even
if
it
is
that
doesn't
mean
we
will
implement
any
changes
or
getting
it
ready
in
the
next
couple
of
months.
So
right
now,
it's
mostly
around
finding
solutions.
B
I
will
say
that
there's
definitely
an
argument
to
be
had
sorry
that
that
this
work
should
be
done
by
the
database
team
or
the
scalability
team.
Obviously
it's
a
matter
of
like
finding
when
they
can
slot
it
in.
But
you
know
if
you're
struggling
with
the
amount
of
things
that
are
that
are
up
in
the
air
like
it
is
a
natural
fit
for
one
of
those
two
teams.
So
it's
probably
worth
engaging
with
them
as
well
like.
A
A
E
A
A
And
in
terms
of
using
a
separate
fluid
for
a
specific
rod,
have
we
done
that
for
any
other
route,
or
do
we
have
a
single
fleet
for
every
single
one
of
them.
B
The
0.05
percent
on
are
not
because
of
you
know,
making
more
bulkheads
like
breaking
out
a
separate
head,
which
is
a
sort
of
bulkhead
generally,
doesn't
help
those
kind
of
things,
because
the
things
that
go
down
like
you
know,
are
the
database
or
maybe
a
bad
deployments,
or
you
know
things
like
that,
like
the
sig
faults
that
we
saw
last
week
from
from
garbage,
collects
compact
problems
and-
and
those
are
not
you
know
saved
by
by
the
the
segmentation
and
and
when
we
have
more
fleets,
there's
more
things
that
we
need
to
to
juggle.
B
B
A
B
A
Okay,
and
do
we
know
which
one
of
those
has
more
availability
on
average?
Is
it
the
same
thing.
B
They
I
mean
it's:
they
they
both
you
know
horizontally
scalable
fleets.
So
if
we
and
we
have
the
same
monitoring
on
both-
and
so
if
we
see
that
one
of
them
is
running
out
of
workers,
we'll
increase
it,
you
know.
Ideally
they
should
both
be
healthy.
You
could
go.
Take
a
look
historically
at
our
availability
on
the
sla
dashboard
in
grafana
and
go
see
which
one's
better.
I
would
guess
that
is
probably
the
api.
B
I
I
think
almost
regardless
of
whether
the
api
is
better
or
not.
We
should
move
it
there,
because
it's
it's
less
surprising,
like
everyone
expects
it
to
be
there
and
if
there's
availability
problems
on
the
api,
we
should
address
that
separately.
I
don't
think
they
are,
but
it
should.
You
know
the
least
surprising
thing
is
good
to
be
on
on
the
api.
C
I'm
also
thinking
that
on
the
web
we
kind
of
perform
much
more
heavy
calculations
really
and
like
that
are
significant
significantly
longer
on
rp
free
today,
on
average
should
be
significantly
faster
to
execute.
So
this
is
also
much
better
for
the
latency
perspective
of
jwt
off
if
you
have
like
more
even
evenly
distributed
durations
instead
of
like
very
spiky
behavior
as
its
own
web,
so
I'm
kind
of
like
skin.
I
fully
agree
with
andrew
that
app
seems
like
basically
better
fit
because
of
the
the
type
of
the
compute
being
executed
there.
B
A
A
A
A
Okay,
so
I
guess,
as
follow-ups
are
always
an
issue
to
try
to
add
information
on
logs,
if
request
is
using,
the
primary
or
read-only
replica
also
consider
consider
moving
this
rod
to
the
api
fleet
instead
of
having
it
on
the
web
fleet,
and
then
we
should
probably
team
start
talking
with
with
the
database
team,
see
whether
they
have
time
and
are
interested
in
helping
with
this
and
scalability
as
well
to
see
if
we
can
indeed
have
these
rods
served
only
with
with
replicas.
B
I've
got
to
run,
but
it's
great
meeting
thanks.
Everyone
yeah.
D
I
can
hijack
a
few
minutes
because
there
is
on
the
on
the
previous
call.
D
We
agreed
that
we
need
to
process
logs
from
our
marquee
customer,
I'm
not
going
to
use
the
exact
name
of
the
customer,
but
basically
we
need
to
do
some
work
to
process
these
logs
and
to
send
them
to
bigquery,
and
there
is
some
engineering
work
that
needs
to
be
done
there
and,
to
be
honest,
I'm
kind
of
struggling
to
find
time
to
do
that,
and
I
wonder
if
anyone
actually
has
a
little
bit
more
capacity
to
work
on
that
with
me
and
collaborate
together
on
this.
A
I
think
at
least
we
can
open
an
issue
under
the
epic
for
the
api
availability
and
if
anyone
has
time
to
tackle
that,
you
need
to
look
at
it
later.
D
Yeah,
okay,
can
we
introduce
epic
I'll?
Add
I'm
showing
there
because
there's
no
issue
for
this
yet
yeah.