►
From YouTube: GitLab Container Registry - High Availability discussion
Description
Discussion with Distinguished Engineers about making GitLab Container Registry highly available.
A
A
So
today
I
hope
that
we'll
have
opportunity
to
talk
about
container
registry
and
higher
availability,
and
that's
that's
something
that
one
of
our
customers
actually
requested
and
unfortunately
we
are
waiting
for
zhao.
I
would
really
like
him
to
join,
but
there
is
an
ongoing
production
incident
related
to
container
registry
right
now,
and
he
told
me
that
he
might
be
a
little
bit
late.
So
there
is
an
agenda
document.
I
will
try
to
take
some
notes
and.
A
A
Yeah,
so
I
just
wanted
to
start
that
this
is
a
very
interesting
problem
and
it's
kind
of
a
difficult
problem
to
solve
for
the
customer
without
knowing
exactly
where
the
problem
is.
I
know
that
the
team
and
other
people,
like
john
hampton,
perhaps
actually
had
this
opportunity
to
talk
directly
to
the
customer
but
yeah.
A
B
So
if
you
could
summarize
it
again
yeah,
so
we
had
a
few
weeks
ago
now,
maybe
a
month
and
a
half
we
had,
we
had
an
outage
that
affected
the
container
registry
and
for
it
ended
up
being
in
total
about
seven
hours,
and
it
just
so
happened
that
one
of
our
major
vip
wireless
customers
was
the
iphone
12
was
released
that
day
and
they
couldn't
pull
anything
and
they
couldn't
pull
an
image
into
their
kubernetes
pods,
so
prevented
them
from
scaling
up
and
they
ended
up
losing
a
lot
of
sales,
so
they
were
very
upset,
but
in
general
they
have
this
idea
that
they
want
to
rely
their
their
gitlab.com
customers.
B
They
want
to
rely
on
gitlab
for
the
management
of
the
platform,
but
they
would
look
like
some
additional
redundancy
when
it
comes
to
their
container
registry,
so
they
it's.
It
kicked
off
a
bunch
of
projects.
We've
been
considering
things
like.
How
do
we
make
the
services
around
the
registry
like
the
auth
api,
more
reliable?
We've
been
considering
things
like
another
issue
that
they
brought
up
was
potentially
having
a
push-pull
mirror
locally
or
in
multi-cloud.
B
So
not
you
know
as
an
option
for
for
them
to
provide
some
additional
done,
redundancy
and
so
where
that's
all
left
off
is
we've
kind
of
told
them.
We
have
some
immediate
plans
that
are
actionable
and
we've
been
making
changes
to
the
runner,
we've
been
making
improvements
to
the
dependency
proxy
and
that
we
will
present
to
them
these
blueprints
and
a
plan
for
this
future
state,
which
could
include
this
push-pull
mirror
or
could
include
higher
availability
in
other
areas
of
the
services.
B
So
with
all
of
that
in
mind,
the
the
engineers
psychic
lab
have
been
putting
together
this
blueprint,
joan
and
gorish
and
haley
have
put
a
lot
of
work
into
and
thought
into
this,
and
so
now
I
this
discussion
is
for
all
of
you
to
come
together
and
say:
okay,
what's
actually
feasible
and
which
direction
do
we
want
to
go.
A
Yeah,
thank
you
very
much
for
for
summarizing
that
and
coming
into
you
have
you
had
opportunity
to
actually
read
the
blueprint
or,
like
would
be.
I.
C
I
haven't
had
a
a
huge
opportunity,
so
I'll
be
honest,
I'm
sorry
about
that.
It's
the
second
day
back.
D
I
mean
like
for
from
me
we
like
looking
over
the
blueprint.
I
think
that
general
idea
is
like
that.
We
somehow
introduce
a
component
that
can
like
be
intelligent
and
cache
or
like
provide
a
mirroring
functionality
to
the
container
registry,
and
it
would
be,
I,
I
presume,
be
run
by
the
customer
within
like
the
control,
and
they
would
have
some
kind
of
notification
system
pull
push
to
have
some
kind
of
consistency.
I
I
think
that
this
is
like
the
high
level
idea
behind
the
blueprint,
and
I
correct
team
and
a
highlight.
A
Yeah,
I
think
that
that
is
correct.
I
see
that
haley
joined
so.
E
Yeah,
so
I
think
the
big
thing
with
that
is:
it's
not
push
pull
it's
it's
pull
only
and
that
introducing
pushes
really
opens
up
a
can
of
worms
since
tags
are
mutable
and
then
we'll
have
the
two
sources
of
truth
for
tags.
If
we
enable
pushes
on
this
local
proxy.
A
Yeah,
so
I
had
opportunity
to
read
the
blueprint
and,
to
be
honest,
I
I
feel
like
I'm
not
completely
sold
to
to
this
idea,
although
I
feel
like
that
might
be
a
very
interesting
proposal.
The
concerns
that
I
have
and
I
might
be
totally
wrong.
So
that's
the
reason
why
camille
and
andrew
and
like
other
people
are
here,
is
that
it's
quite
a
complex
solution
and,
to
be
honest,
I'm
not
completely
sure
if
that
is
actually
going
to
solve
the
problem
for
the
customer.
A
It's
a
complex
solution,
because,
first
of
all,
we
don't
really
know
how
the
infrastructure
of
the
customer
looks
like.
We
know
that
it's
kubernetes,
but
you
can
model
many
different
things
with
kubernetes
and
it's
it's
always
a
little
bit
of
a
challenge
to
introduce
a
highly
available
service
in
a
distributed
environment,
and
we
do
not
really
know
if
a
customer
is
going
to
do
that
successfully,
and
there
is
always
this
component
of
how
to
actually
model
a
highly
available,
presumably
eventually
consistent
data
store
in
kubernetes.
A
From
what
I
understood
from
the
blueprint,
we
do
not
really
want
to
use
object,
storage
because
object.
Storage
is
not
reliable
enough
right,
so
the
idea
is
to
actually
build
some
kind
of
a
proxy
that
has
this
peer-to-peer
data
exchange
mechanism
built
in
and
I
feel
like.
We
are
stepping
on
the
this.
You
know
difficult
territory
of
tackling
consensus,
algorithm
and
stuff.
C
The
first
concern
I
have
which
I've
already
raised
with
you
is
that
I
think
that
as
soon
as
you
start
saying
like
five
nines,
that
is
not
something
that
that
modern
infrastructure
is
built
around.
So
you
know
I
pointed
you
to
there's
our
ebook.
You
know.
If
you
look
at
google
cloud
amazon
gcp,
nobody
would
say
that
they're
gonna
give
you
five
nines
of
availability,
it's
kind
of
like
interest
in
a
bank.
If
your
bank
was
like
99.99
or
an
investment
rather
than
a
bank.
C
If
you
had
a
99.999
chance
of
of
of
getting
your
money
back,
then
the
chances
of
you
making
any
money
from
that
are
very
low
and
that's
the
same
way.
The
risk
that
you
can
take
if
you're
trying
to
keep
an
availability
of
five
nines,
is
almost
nothing
and
so
that
that
holds
back
a
lot
of
stuff
on
gitlab.com.
C
So
there's
a
whole
lot
that
I
can
talk
about
that.
But
I
think,
as
a
first
point,
we
really
want
to
say
like
like
five
nines
is
not
a
realistic
target
that
we
want
to
go
with,
but
at
the
same
time
maybe
what
we
can
do
is
build
something
that's
that's
very
decoupled
from
the
main
application
and
that
will,
through
that
and
redundancy
on
that
we
can
kind
of
team
towards
that,
but
not
kind
of
build
it
into
the
application.
C
C
You
could
have
three
copies
of
this,
and
if
one
of
them
went
down,
you
know
the
say
the
aws
one
could
still
stay
up
or
the
azure
one
could
still
stay
up.
That
would
be
something
like
I,
I
think
that'll
be
much
simpler
to
get
the
availability
through
redundancy
rather
than
through.
You
know,
distributed
protocols
or
anything
like
that
that
that
is
a
whole
rabbit
hole
which
I
don't
think
we
want
to
go
down
and
to
quotes
the
you
know
value
it's.
C
It's
not
a
boring
solution
where
redundancy
is
and
there's
a
lot
of
very
available
software.
That's
just
built
on
top
of
redundancy,
so
I
think
we
should
try
and
focus
on
on
that,
rather
at
least
without
having
a
lot
of
knowledge
of
the
solution
yeah.
I
I
agree
with
you
andrew.
E
Yeah,
so
I
think
what
we've
observed.
We
also
have
issues
open
for
the
gitlab.com
registry
service
kind
of
pursuant
to
this
goal
of
more
uptime
higher
availability
and
there's
a
you
know.
There's
the
issue
of
the
registry
depends
on
the
gitlab
api
to
authenticate
and
then,
besides
that,
there's
gcp.
E
So
both
of
those
have
9.95
or
so
I'm
not.
I
don't
have
the
exact
numbers
in
my
head
right
now,
but
so
I
think
that's
on
that
front
of
just
making
the
servers
more
available.
We
have
that
kind
of
a
two-prong
thing.
It's
about
to
be
pronged
once
the
metadata
database
is
up
as
well
for
the
registry
and
what
we've
observed
is
it's
the
weakest
link
in
recent
recent
outages
has
been
the
api
for
authentication.
That's
been
lower,
has
a
lower
availability
than
the
storage
bucket.
C
You
can
I,
I
have
no
information
on
me
right
now,
but
my
when
I
think
back
to
the
incidents
that
I've
been
involved
in
a
large
number
of
them
have
been
down
to
gcs,
storage
and-
and
you
know,
availability
problems
on
that.
Certainly
like
the
last
three
or
four
registry
issues
that
I've
built.
The
incidents
that
I've
been
involved
in
have
all
been
that.
So
do
you,
do
you
have
numbers
on
that
or
like
what
did
you
use
to
base
that
on.
E
D
I'm
I'm
kind
of
I'm
kind
of
wondering
because,
like
I
write
that
comment
as
well
like
like,
we
consider
happy
to
be
like
something
significantly
wider
than
registry
does
require.
So
I
I
was
kind
of
like
proposing
at
some
point
that,
if
app
is
a
problem,
is
there
some
like
way
to
disconnect
like
this
jwt
of
end
point
from
the
rest
to
be
like
a
self
self,
like
additionally
managed
component?
D
If,
if
this
is
something
that
like,
we
have
so
much
emphasis
on
sla,
because
right
now
like
it's
in
this,
like
the
amount
of
the
request,
this
endpoint
receives
it's
like
pretty
high
magnitude,
but
it
kind
of
falls
into
the
common
bucket
and
basically,
the
noisy
neighbors
can
really
heart
this
availability
of
the
registry
and
maybe
like
one
of
the.
D
If
this
is
really
like
the
problem
that
we
are
facing
with
the
appi,
maybe
one
of
the
the
first
step
is
like
to
disconnect
this
endpoint
from,
like
general
epi
feed
or
like
the
web,
switch
to
the
separate
fleet
the
same
way,
how
we
have
like
the
git
handling
separately
being
done
to
web
and
happy
if
this
is
so
important
component
compared
to
everything
else.
A
So
I
think,
actually,
it's
possible
to
actually
improve
the
apa
endpoint
for
the
container
registry
authentication,
but
I
would
like
to
still
challenge
this
idea
that
we
need
to
dramatically
improve
the
availability
of
container
registry,
like
that.
What
what
we
do,
in
my
opinion,
depends
on
what
the
problem
of
the
customer
is
exactly
like.
I'm
still
not
able
to
you
know,
have
an
answer
for
such
a
simple
question
like.
Is
that
all
the
images
that
they
have
that
they
need
to
have
highly
available?
A
Or
is
there
one
or
two
images
for
just
the
services
that
are
being
updated
most
frequently
that
they
need
to,
like
you
know,
have
in
a
hot
cache
all
the
time
like.
B
Well,
one
of
the
things
that
they
they
push.
Many
many
images
they
change
turnover
pretty
quickly,
but
one
of
the
things
they
were
talking
about
was
even
if
they
had
some
of
the
images
cache
that
would
have
helped
them
because
they
could
have
used
an
older
image
that
was
in
the
cache
and
that
would
have
been
okay.
B
And
when
we
were
talking
about
five
nines.
I
think
the
contact
at
the
customer
was
saying
they
don't
really
want
five
nines,
like
they
understand
that
that's
not
may
not
be
in
our
zone,
but
they
were
really
worried
about
the
time
to
resolution
like
if,
if
we
had
99
availability,
but
we
never
had
an
outage
of
more
than
a
half
hour
or
an
hour
and
and
things
were
held
in
the
cache
that
would
probably
be
okay
with
them.
B
So
yeah,
I
don't
think
that
they're
saying
you
have
to
hit
five
nines,
or
else
we're
going
to
go
elsewhere.
I
think
they're
just
saying
we
need
to
have
options.
We've
got
redundancy.
A
B
Pulling
pushing
definitely
affects
their
builds,
so
if
their
builds
are
broken,
that's
frustrating,
but
they
could
live
without
that
for
some
time
the
problem
was,
they
were
trying
to
scale
up
the
app
and
pull
images
and
they
couldn't
do
it
and
they
couldn't,
and
that
was
real.
That
was
really
the
core
of
their
concern
is
if,
if
something
happens,
and
they
need
to
scale
up
their
site,
they
need
to
be
able
to
pull
reliably.
What.
A
E
I
asked
a
similar
question
and
I
think
one
of
their
concerns
was
the
amount
of
images
that
they
use
and
the
size
of
those
images.
Just
just
the
storage
capacity
there
to
have
like
a
warm
cache
and
all
their
cubelets
would
be
too
much.
A
C
Yeah,
okay,
so
that's
that's
why
the
cash
hit
rate
would
be
a
problem
because
you
know
it's
per
node,
but
but
is
there
not
some
kubernetes
cash
or
at
least
some
sort
of
container
caching
solution?
That's
like
on
the
cluster
that
could
be
shared
across
because
you
know
I
can
imagine.
Even
if
you
have
a
lot
of
images
you
should
be
able
to
set
aside.
You
know
a
few
terabytes
for
like
a
local
cache.
A
C
D
C
Yeah,
I
I
I
and
obviously
we
have
to
use
a
different
machine
or
different
cluster,
because
it
wouldn't
be
good.
If,
if
we
took
everything
down
and
then
we
couldn't
spin
up
new
machines,
but
I
I
do
think
it'd
be
good
if
we
had
like
a
product
solution,
even
if
it
was
like
some
third-party
piece
of
software,
that
we
could
encourage
customers
to
like
run
a
cache
locally,
because
it
obviously
takes
load
of
gitlab.com-
and
you
know
the
bigger
their
cluster
gets.
C
D
I'm
kind
of
thinking
because,
like
we
have
a
container
proxy
in
the
in
the
github
asset
feature
like,
could
they
basically
run
their
own
github?
That
would
be
configured
against
different
object,
storage
provider
or
whatever
and
basically
request
images
from
their
gitlab.
But
this
gift
output
basically
requests
images
from
our
github.
A
E
A
E
D
A
Interesting
because
I
I
still
have
questions
about
about
the
storage,
because
from
what
I
understood,
they
want
to
have
a
highly
available
solution.
So
if
we
build
this
on-premises
reverse
caching
proxy
for
container
registry,
they
would
need
to
run
it
in
kind
of
a
highly
available
manner.
A
And
how
do
we
approach
storage
in
that
particular
case,
because
if
they
also
want
to
run
it
inside
kubernetes,
it
might
be
a
little
bit
tricky
running
it
in
a
highly
available
way.
Outside
of
kubernetes
might
be
tricky
as
well,
but
it's
always
tricky
when
we
start
thinking
about
where
they
are
going
to
store
their
cache.
F
D
Kind
of
thinking
that
this
is
like
the
unsurvivable
problem
with
that
approach
like
if
you
have
the
high
dsmr
caching
service,
that,
like
you,
base
your
availability
on,
it's
just
like
waiting
for
the
disaster
to
happen
at
some
point
like
it's.
If
service,
for
whatever
reason,
fires
and
github.com
face
at
the
same
time,
you
just
don't
have
any
data,
you
cannot
really
it
doesn't.
It
doesn't
prevent
you
on
the
authors.
It's
casing
is
good
to
like
to
reduce
the
load,
but
if
things
go
wrong,
they
kind
of
go
wrong
like
in
cascade
way.
C
I
mean
so
just
to
challenge
that
a
little
bit
like
if
you
had
the
most
boring
solution
right.
You
had
three
caching
nodes
that
were
each
in
a
different
region
and
you
actually
just
went
with
like
block
storage
like
really
really
boring,
old-fashioned
block
storage
behind
as
the
as
the
cache
thing,
which
is
pretty
reliable.
Like
I
can't
remember
the
last
time
we
had
a
big
block,
storage
incidence
and
you
have
three
caches
and
you
know
hits
are
randomly
pushed
to
different
caches.
C
If
one
of
them
falls
over,
you
know
that's
what
kubernetes
is
good
at
they'll
direct,
the
request
to
the
other
two
caches.
We
don't
need
to
build
any
complicated
mechanisms
for
kind
of
retrieving
between
the
caches
or
anything
like
that,
and
if
one
of
them
one
of
the
nodes,
goes
down,
it
just
gets
taken
offline
and
if
it
just
so
happens
that
you
know
that
there's
an
image
request
that
that
isn't
on
one
of
those
nodes,
then
so
be
it
then
you
know,
that's
that's
and
it
and
obviously.
D
I
I
know
but
like
there
is,
there
is
one
problem
with
that,
like
you
assume
that
your
your
customer
is
significantly
smaller,
the
amount
of
the
images
that
you
make
hard,
it
means
that
at
some
point,
you're
gonna
have
the
cash
evicting
to
happen
event
and,
for
example,
like
you'll,
have
your
application
running
on
the
kubernetes
from
the
order
image?
This
image
got
already
evicted
from
all
cache
notes,
but
your
absent
provider
is
is
gone.
You
cannot
you
don't.
D
C
Perhaps
having
some
more
information
on
like
those
patterns
would
be
helpful
here,
because
my
gut
is
the
is
the
things
that
you
need
to
scale
the
most
up
probably,
and
this
could
be
totally
wrong
because
I
have
no
background
on
this,
but
my
guess
would
be
the
things
that
you're
scaling
the
most
also
the
hottest
things
in
your
cache
and
and
the
things
that
are
most
critical
to
cache
to
to
scaling.
C
You
know,
like
maybe
your
you
know,
customer
facing
websites-
that's
got
some
new
specials
on
it
and
those
things
are
the
most
likely
things
to
be
in
the
cache
rather
than,
as
you
say,
the
you
know
the
three
week
old
image
that
you
know
it
hasn't
seen
as
much
activity.
I
might
be
wrong
on
that,
but
that's
just
my
gut
feel.
D
That's
that's
my
concern
really
but
like
if
it
faces
it
spice
badly
and
like
it's,
this
mechanism
doesn't
will
not
help
you,
I'm
kind
of
like
thinking
that
like
if
you
want
to
have
like
the
actually
highly
available
service,
you
need
to
have
actual
replica
of
this
data
in
multiple
resources,
and
you
just
try
these
sources
to
fetch
them
and,
like
you,
just
retain
these
data
as
long
as
you
need
them
basically,
and
this
kind
of
gives
you
like
the
guarantee
that,
like
if
app
is
fluke,
you
just
have
another
source
that
it's
not
dependent
on
that
app
and
has
another
type
of
the
storage
it
could
be.
D
I
don't
know
azure
with
their
object
storage.
That
is
basically
different
technology,
but
actually
have
that
highly
available
because,
like
that,
it's
kind
of
to
some
extent
as
soon
as
different
services,
availability.
C
D
The
app
is
interesting
problem,
but
like
in
the
current
registry,
like
even
the
service
is
like
external
to
to
the
registry,
so
it
could
be
even
like
I
don't
know,
http
basic
off
in
in
the
simplest
case,
really
to
provide
a
jwt
token
so
like
it
could
be
like
the
most
boring
solution
really
to
to
have
additional
storage
for
these
images.
D
I'm
kind
of
like
thinking
that,
like
we
have
this
proxy
in
the
container
registry
of
the
github,
it
kind
of
source
some
of
these
problems,
but
it's
still
like
it's
still.
Caching,
I
just
find
like
the
efficiencies
with
the
caching
and
assuming
that
the
classroom
gonna
prevent
this
kind
of
like
problem
that
customer
had
in
this
particular
case.
A
But
perhaps
we
are
not
actually
solving
a
high
availability
problem
here,
because
it's
not
what
customer
needs
it.
It
might
be
something
that
they
articulate
they
need,
but
it
might
be
something
that
they
don't
need
actually,
and
perhaps
caching
could
be
enough,
but
I
I
feel
like
if
caching
would
be
enough
and
in
what
form
it
depends
on
tiny
technical
details
of
their
problem,
and
I
I've
not
seen
a
complete
description
of
the
problem
in
a
document
or
like.
A
I
know
that
we
talk
to
them
a
lot
and
that's
you
know
the
reason
why
we
do
have
this
blueprint
written.
But
it's
not
entirely
clear
to
me
what
the
problem
is
exactly
because
it
depends
on
small,
technically
the
technical
details
that
are
not
here
and
yeah.
I
just.
C
Just
on
that
point,
one
of
the
things
that
would
maybe
help
here
is:
we
could
look
at
the
registry
logs
and
get
an
idea
of
those
traffic
patterns
for
this
particular
customer.
You
know
for
for
a
week,
or
we
could
look
in
in
our
in
our
indeed,
you
know
in
other
storage
locations
and
get
more
data
than
that
or
we
could
see
sort
of
what
the
working
set
is
and
then
you
know
we
could
probably
figure
out
like
how
big
like
a
cache
would
need
to
be
and
and
then
get
a
much
better
idea.
A
A
very
good
thing
for
that
suggestion.
In
my
opinion,
it
might
not
like
give
us
a
full
picture
of
events
like
the
one
with
scaling
and
the
release
of
a
new
product,
but
it
actually
gave
us
some
insights
about
how
they
modeled
their
infrastructure
behind.
If
we.
C
C
Yeah
we
keep
them
in
in
google
cloud
storage
ironically,
so
we
could.
We
could
pull
that
and
then
sort
of
work
and
we've
also
got
like
the
bytes.
You
know
the
number
of
bytes
that
we
sent,
so
we
can
from
that.
You
know,
look
at
what
the
working
set
size
would
need
to
be
for
a
cache
to
have
been
effective
and
kind
of
do
a
better
modeling
around
that.
A
Yeah
that's
interesting.
I
would
like
to
still
highlight
one
interesting
in
my
opinion
proposal
from
the
blueprint
like
the
blueprint
actually
describes.
I'm
sorry
for
the
background.
Noise.
Kids
are
back.
So
the
the
the
blueprint
describes
a
notification
mechanism
designed
to
actually
notify
the
on-premises
thing
whatever
it
is,
either
it's
a
cache
or
either
it's
something
else,
and
it's
simply
a
web
hook
being
sent
from
gitlab
to
this
external
service
that
will
allow
to
preheat
the
cache
or
worm
it
up
whatever.
A
We
call
it,
and
I
think
that
it
can
actually
help
not
only
this
customer
but
other
customers
to
design
their
own
solution
if
they
can
simply
go
to
github
and
configure
a
web
web
hook.
That
will
notify
their
service
that
there's
a
new
image
pushed
or
a
new
version
of
the
same
tag.
D
A
So
my
question
is:
we
can
actually
get
the
data
from
logs
and
understand
better.
What
is
the
customer
problem?
We
basically,
I
think,
agree
that
building
a
web
hook
a
notification
mechanism
can
help
not
only
in
this
particular
case,
but
in
other
cases
as
well.
A
C
Just
one
thing
that
I
have
seen
in
the
past
is
that
we
one
of
the
highest
end
points
in
terms
of
traffic
on
gitlab.com
at
the
moment
is
that
jwt
was
in
points-
and
I
think
we've
mentioned
it
before,
but
there
seems
to
be
like
a
one
to
one
between
fetching
a
container
and
the
author
request,
and
if
we
could
reduce
that
or
kind
of
gear
that
down
a
little
bit,
that
might
make
a
huge
difference.
D
C
Consider
moving
that
elsewhere,
I
really
have
to
go,
but
I've
just
got
one
other
very
subtle
point
as
to
your
point
ali,
and
that's
that
weirdly,
the
jwt
author
endpoint,
does
not
get
routed
to
the
gitlab
api
and
for
purely
technical
debt
reasons.
It
actually
goes
to
the
web
service,
which
is
very
confusing
and
not
what
you'd
expect
and
something
that
people
have
been
meaning
to
fix
for
a
very
long
time.
It's
just
about
how
aj
proxy
reads
things,
but
that's
just
something
to
notice
it's
very
confusing.
C
A
Yeah,
so
it's
a
bit
unfortunate
that
zhao
couldn't
join
today,
but
I
think
that
I
really
like
this
idea
of
checking
in
our
logs
how
the
customer
is
using
container
registry
that
it
can
actually
give
us
answer
to
a
question
of
how
many
images
they
do
use
and
how
big
they
are
in
in
total
and
and
then.
I
think
that
the
on-premises
caching
proxy
might
be
actually
a
good
solution.
But
we
need
to
simplify
it
somehow.
D
E
E
Maintaining
that,
in
addition
to
the
the
main
sort
of
registration,
endpoint.
D
I
I
have
also
one
more
suggestion,
which
is
after
exaggeration,
the
jwt
out
is
basically
so
frequent,
so
we
could
actually
calculate
pretty
much
real
sla
of
this
endpoint
in
particular,
like
taking
into
account
like
the
duration
and
like
how
many
errors
were
produced
in
the
given
period
of
time.
So
like
because,
like
as
andrew
said
like
this
is
jwt
out
which
is
going
to
the
web.
D
Freeze,
not
the
api
fleet,
but
since
this
is
also
like
very
noisy
neighbors
like
it
could
allow
us
to
estimate
if
we
would
like
run
that
endpoint
separately
to
everything
else.
What
sla
we
could
offer
like
what
was
like
the
sla
on
this
endpoint
so
far,
and
when
it
failed.
A
So
let
me
check
if
we're
on
the
same
page,
you
you
suggest
to
go
to
our
logs
to
understand
better.
What's
the
availability
number
for
the
authentication
endpoint,
because
we
can
calculate
availability
based
on
the
amount
of
requests,
successful
and
failed
right,
and
this
way
we
can
actually
get
the
concrete
number
from
the
last
week,
for
example,
or
we
can
use
the
historical
data
to
get
it
from
the
last
I
don't
know
month
or
or
year.
Is
that.
D
I
mean
like
yes,
I
mean
like
if
you
would
somehow
be
able
like
to
find
intervals
from
let's
say
last
year,
based
on
the
logs
like
when
this
endpoint
fight.
This
could
give
us
like
very
good
in
like
hint
about
the
sli
of
this
endpoint.
Historically,
I
mean
this
particular
endpoint.
It's
still
gonna
be
affected
by
the
noisy
nightboards,
but
I'm
just
curious
about
like
this
particular
endpoint.
E
That
in
point
you're
getting
it
really,
the
authentication
ultimately
relies
on
the
database,
the
rails
database,
and
I
think
that
has
a
lower
sla
than
this
customer
desires
itself.
It's
necessarily
you
know
that
endpoint,
you
know,
regardless
of
whether
you
can
hit
it
or
bring
it
like.
I
think
it's
ultimately
relying
on
a
service
that
the
customer
has
indicated
is
not
available
enough
for
their
case.
A
B
The
the
early
work
that
we
did
after
the
incident
happened
helped
because
we
added
we
added
multi-zonal
clusters
and
we
upgraded
our
support
contract
with
google.
We
were
using
like
a
third-party
vendor
before
so
now
we
have
direct
support
line
with
google,
so
those
things
helped
a
little
bit,
I
guess
hopefully
but
unknown.
B
F
B
Yeah
exactly
they
are
currently
using
artifactory
as
a
pull
through
local
on-premise
cache
they're,
not
happy
about
it,
because
they're
maintaining
it
and
like
kelly
mentioned
there,
they
want
gitlab
to
be
administered
totally
by
gitlab.
They
have
this
whole
core
versus
context.
Approach
that
they
want.
No
one
will
be
better
at
managing
gitlab
than
gitlab
is
what
their
architecture.
A
B
Yeah,
I
guess
they
were
they.
That
was
one
idea
that
was
brought
up
and
the
other
another
idea
was
like
a
multi-cloud
pull
through
cache
so
that
they're
they're
reaching
you
know
they
they
want
to.
They
want
to
solve
this
problem,
but
they're
looking
to
us
to
say
we
actually
don't
know
no
one.
No
one
is
doing
that.
Whatever's
happening
is
not
sufficient,
so
they
they're
evaluating.
A
On-Premise
discussion
proxies.
So
if
we
build
this
on-premise
discussion
proxy,
that's
definitely
going
to
be
a
huge
effort
and
give
it
to
them
to
maintain.
Are
they
going
to
be
happy
with
that
solution
or
not.
B
E
Yeah
I
mean
something:
they've
mentioned,
I
mean
they've
sort
of
they've
talked
about
their
ideal.
That
is
a
you
know,
endpoint,
that
just
is
always
available,
which
yes,
that's
ideal.
I
think
they've
mentioned
this
as
sort
of
a
stop
gap
before
that
service
is
up
to
an
availability
that
they're
comfortable
with.
E
I
think
part
of
this
proposal
is
to
to
show
the
customer
and
say
this
is
a
possible
solution.
How
do
you
feel?
We've
worked,
some
of
the
technical
words
out
and
you
know
showing
some
light
on
what
this
would
really
be
like
I,
I've
been
wondering
as
sort
of
breaking
away
from
this.
A
little
bit
is
if
it
would
be
possible
for
us
to
work
with
them.
E
You
know
have
infrastructure
team
come
and
work
with
them
on
something
that
is
like
sort
of
a
best
practice
for,
like
you
know,
having
using
multiple
registries
like
not
only
gitlab.com,
but
also
something
like
docker
hub
or
key,
or
something
like
that
and
having
having
a
an
infrastructure
that
can
sort
of
adapt
to
one
of
those
endpoints
being
down,
because
I
think
that
takes
some
of
the
the
pressure
off
of
us
to
be
this
endpoint
that
can
that
never
goes
down
that
has
a
availability,
that's
higher
than
what
you
see
in
most
cloud
services.
E
D
I'm
kind
of
thinking
that
really
like
there
are
probably
like
two
paths
like
one
path
which
is
like
even
like.
We
should
kind
of
figure
out
a
way
to
improve
our
service
and
like
some
of
the
aspects
that
your
team
mentioned
like
they
did
have,
but
we
still
have,
for
example,
this
happy
concern
that
hayley
is
mentioning,
but
even
there
like,
we
can
improve
that
like
this.
Like
we
checked,
I
think
highly
like
recently
that,
like
it's
using
using
database
replica,
so
it
can
be
like
really
exempt
from
the
all
the
other
storms
around
database.
D
A
By
particularly
about
the
api
authentication
endpoint
like
we,
we
can
build
a
separate
service.
We
with
a
separate
data,
store
and
push
like
latest
set
of
privileges
and
credentials
to
that
separate
data
store
and
separate
service.
It
might
be
like
you
know,
more,
like
the
computer
service,
but
it's
definitely
possible
to
build
something
like
that.
The
question
is:
do
we
really
need
to
do
that
without
knowing?
What's
the
number
of
the
availability
for
this
endpoint
right
now
or
not,
because.
D
I
I'm
not
talking
rebuilding
the
whole
authentication
scheme.
I
I
think
that
this
is
really
like
the
last
thing
that
we
should
be
doing.
It's
like
it's.
It's
super
complex,
given
the
multitude
of
these
schemes,
but
like
like
even
like
really
like
you
mentioned
that,
like
we
had
the
troubles
with
the
database
they're
like
we
have
a
ways
to
overcome
that
for
that
endpoint.
If
you
would
really
want
to
because
it
doesn't
require
the
main
database,
it
can.
D
It
seems
to
pretty
much
work
on
the
replica
today
so
like
it
may
be
less
respectable
to
storm.
But
then
I
I
heard
that
we
also
talked
about
the
jio
for
the
github.com,
and
this
is
maybe
like
the
long
term
aspect
like
on
how
to
solve
that
problem,
because,
like
maybe,
we
would,
at
some
point,
have
like
the
sibling
gitlab
running
with
all
the
same
data
replicated
across
different
zones
with
the
database
and
everything,
and
maybe
like
the
jio,
would
be
like
the
ultimate
solution.
A
D
But
I
also
like
that
highly
suggestion
about
like.
Can
we
just
push
this
image
from
to
multiple
places?
Can
we
configure
github
ci
to
be
able
to
specify
multiple
places
for
the
cooling
damage?
Can
we
configure
kubernetes
to
try
multiple
places
to
pull
in
that
image,
and
this
seems
like
there's
something
that
like
if
they
would
buy
the
subscription
on
the
docker
hub
or
elsewhere,
they
could
basically
have
like
the
cloud
managed
service,
really
that
they
would
have
all
these
data
always
replicated
across
different
clouds.
B
Yeah,
I
agree
with
that
one
one
change
we
made
to
the
runner.
That
was
really
helpful
is
we
gave
them
a
variable
that
just
says
like
dependency,
proxy
url
and
they're
able
to
fill
it
in
in
their
group
settings
so
that
they
did
they
were
concerned.
They
were
going
to
have
to
go
in
and
update
all
of
their
many
thousands
of
developers,
pipelines
to
point
to
a
specific
dependency
proxy,
but
now
they're
able
to
use
like
a
group
setting
for
that.
B
So
I
think
these
kinds
of
small
changes
that
improve
not
only
what
their
experience,
but
all
of
our
customers
experiences
are
what
we
should.
They
seem
to
me.
They
seem
more
desirable
and
more
reasonable.
I
think
if
we
go
and
we
if,
if
this
solution
for
the
on
prem
proxy
and
cash
seems
complicated
and
risky
to
to.
F
D
To
money
pull
from
man
run
from
man,
it's
also
something
super
disabled
for
us
like
we
are
still
affected.
We
are
so
affected
by
the
same
problem.
Our
ci
gonna
be
down
if
we
cannot
pull
the
image
from
the
dev
guitar
park
today
right
and
it's
like
a
big
thing
for
our
time
as
well
so
like.
We
are
also
affected
by
these
problems
really
today
that
they
are
facing
and
like,
and
I
think
that
we
really
have
probably
the
same
approach
to
that
problems
as
they
have
like
we.
D
We
want
to
maintain
minimal
amount
of
the
components
to
run
the
service,
but
we
also
run
the
cloud
native
right
now
on
the
on
the
github.com,
so
we
also
have
to
pull
many
versions
of
the
github
from
from
the
container
registry
of
github
or
from
the
docker
hub.
So
we
also
have
the
same
problems
that
they
have
really
and
I
I
think
like
we
would
be
really
like
the
first
customer
to
solve
these
problems,
for
because
it's
like,
if
we
cannot
scale
up
our
application,
they,
then
our
customers
gonna
be
affected
as
well.
D
If
we
cannot
like
release
the
fix,
our
customers
gonna
be
affected
as
well.
So
I
I
think,
like
this
is
the
important
aspect
of
that
is
like
they
have
the
problem
that
we
really
have
as
well.
We
just
didn't
yet
encounter
this
to
be
a
problem
for
us.
A
Yet
so
that's
interesting,
and
I
think
that
we
should
take
a
step
back
and
try
to
understand
the
customers
problem
again
like
as
andrew
suggested,
we
can
pull
data
from
our
elastic
cluster.
To
like
understand
the
how
the
customer
is
using
container
registry
better,
then
we
can
calculate
the
api
availability
number
from
from
the
locus
as
well.
A
Then
I
think
it
it
would
be
great
to
actually
triangulate
the
solution
looking
at
what
we
might
need
as
a
company
and
the
first
customer
of
content
registry
and
what
other
users
and
why
their
community
might
benefit
from.
I
wonder
team,
if
actually
it's
it
would
be
possible
to
get
like
a
document
in
google
docs
or
something
that
I
will
describe
the
the
problem
that
the
customer
is
facing
a
little
bit
in
in
more
detail.
B
A
Yeah,
so
so
I
think
it
would
be
extremely
helpful
to
actually
you
know:
have
it
in
a
single
place
so
that
you
know
people
interested
in
helping
with
fighting
solution
could
read
that
and
understand
the
problem
better.
It's
I'm
always
thinking
that
you
know
understanding.
The
problem
is
a
prerequisite
for
finding
a
solution,
and
if
we
could
triangulate
the
solution
with
what
we
need,
what
what
the
wider
community
needs
it.
It
will
be
probably
the
perfect
solution.
A
So
my
like
there
are
no
perfect
solutions,
but
it
it
might
be
actually
good
good
enough
solutions.
So
I
wonder
if
we
can.
Actually,
you
know,
have
all
the
free
action
points
done
before
the
next
meeting,
because
I
feel
like
that's
a
very
you,
know,
interesting
problem
and,
having
I
don't
know,
bi-weekly
meetings
to
discuss.
This
would
actually
help
a
lot
as
well
and
then
next
time
zhao
perhaps
will
be
able
to
join
as
well.
B
So,
for
me,
the
action
item
is
just
to
have
the
problems
that
this
customer
encountered
in
one
place
and
that
I
think
we
should
have
it
to
the
main
epic
that
we're
using
the
architecture
for
just
I'll
just
make
sure
it's
there
and
I'll
share
it
with
the
people
on
this
call
restaurant.
That
would.
A
Be
perfect
now
we
try
to
get
data
from
logs
so
that
we
could
understand
how
they
are
using
container
registry
better
and
also
calculate
the
number
of
what
is
the
availability
of
the
container
registry
authentication,
api
endpoint.
B
And
and
the
next
output
from
there
could
be
now
we
we
have
this
problem.
We
can
are
looking
at
the
data.
A
next
step
from
there
could
be
a
follow-up
meeting
with
the
customer
to
better
understand
what
they've
done
or
maybe
make
some
recommendations
on
how
to
pull
from
multiple
registries,
as
well
as
we'll
be
able
to
now
talk
through
some
of
the
ideas
that
were
brought
up
pretty
shallowly.
In
our
initial
conversation
like
oh,
we
want
a
a
cap,
an
on-prem
cache.
B
A
Our
you
know,
discussions
on
the
merch
request
can
actually
be
something
that
is
interesting
to
them
as
well,
so
yeah,
our
collaboration
on
on
the
blueprint
might
actually
you
know,
be
good
for
them
to
read
and
understand
that
we
actually
explored
all
these
ideas
thoroughly.
I
totally
agree.
B
Okay
and
and
then
we'll
have
some
recommendations
that
are
in
line
with
our
goals,
but
you
know,
like
you
all
mentioned,
we
encountered
these
problems
that
well
so
increasing
the
reliability
and
redundancy
is
a
good
thing,
but
we
should
do
things
that
make
sense
that
are
the
boring
solutions
and
that
are
iterative
and
not
take
on
big
scary
projects.
To
start.
E
E
B
That
makes
sense
yeah
I'm
on
board.
With
that
too,
I
like
the
idea
of
improving
the
auth
api,
whatever
we
could
do
around
that
and
anything
that
we
could
do
that
around.
I
know
we
said:
caching
won't
fix
everything,
but
if
there's
something
we
could
do
there
and
some
infrastructure
recommendations.
That
would
be
great.
D
I
think
that,
like
the
right
way
for
this
would
be
to
step
back
and
describe
the
problem
in
as
many
liters
as
possible
and
and
then
like
figure
out
exactly
which
parts
of
the
problem
we
want
to
solve
because,
like
I
think
we
we
mentioned
a
few
potential
improvements
to
the
whole,
but,
like
whatever
is
more
important
right
now.
It
really
depends
on
like
on
the
on
the
description
of
the
problem
and
now,
like
I
think,
for
the
user.
D
Looking
at
the
blueprint,
it
seems
that,
like
we
like,
we
gonna
work
on
that
particular
solution
to
to
describe
some
some
subset
of
the
problems,
but
I
think
the
general
idea
behind
the
blueprints
is
like
high
level
definition
of
the
problem
that
we
are
solving,
maybe
with
some
hints
and
different
approaches
how
they
can
be
solved,
but,
like
the
blueprint,
is
not
to
describe
the
actual
solution.
I
think
the
epic
issues,
but
all
the
blueprints
are.
A
Yeah,
so
what
I,
the
my
strategy
behind
building
blueprints,
was
always
to
describe
the
problem
and
the
vision
that
will
help
us
like
to
solve
that
problem.
Like
architecture
is
always
a
hypothesis
and
you
iterate
on
architecture
to
in
order
to
actually
prove
your
hypothesis
and
then,
if
something
doesn't
work,
you
adjust
your
trajectory
so
yeah,
that's
just
a
random
thought.
Okay,
so
we
do
have
a
production
points.
A
I
will
create
issues
and
then
I
will
schedule
the
next
call
in
presumably
something
like
a
week
or
two
weeks
and
yeah.
I
just
wanted
to
thank
everyone
for
waking
up
early
and
joining
us
thanks
thanks
camille
for
for
joining
as
well
and
again
it
was
great
that
andrew
joined.
A
So
thank
you
very
much
and
see
you
next
time
have
a
great
day,
bye
and.