►
From YouTube: 20200415: High availability Gitaly demo
Description
A
Cool,
it's
4:30
on
my
time,
so
I'm
gonna
get
started
here.
I've
done
a
little
bit
of
prep
work,
so
we
don't
go
through
everything
here.
I've
set
a
prefect
database
already
so
I'm
gonna
skip
that
I
haven't
actually
run
reconfigure,
but
I
did
want
to
show
one
difference
here:
I'm.
What
I'm
doing
here
is
I
added
a
load
balancer
in
here
for
pre
fake,
so
just
keep
in
mind
that
we're
gonna
this
demo
we're
gonna
demo,
multiple
prefix
in
front
of
a
little
bounce
right,
I've
actually
tested
this.
So
this
will
be
exciting.
A
Okay,
so
I'm,
just
gonna
broadcast
make
sure
they're
all
just
the
configurations
are
the
same.
This
is
basically
just
the
same
thing.
There
are
two
new
configuration
options:
I'm
gonna
flip
on
now.
This
is
for
using
multiple
free
prefix.
So
I'm
gonna
change
that
to
true
and
I'm,
going
to
change
that
to
see
Cole.
A
A
A
All
right
so,
okay,
those
already
told
me
failover,
okay,
so
we've
already
got
these
I'm
surprised
that
this
doesn't
achilles.
Section,
though,
is
it
that's
a
little
weird
okay.
So
it's
basically
saying
go
back
to
the
pre
I
think
it'll
yeah,
so
I
I
sky
combine
this
stem,
but
do
we
really
need
to
split
them
up
right.
B
So
they
tell
you
to
keep
your
last
time.
I
read,
will
dog
the
reason
I
was
split
up
is
there's
a
after
you
configure.
You
basically
verify
that
you
can
watch
it's
so
like
if
we've
done
them
in
another
order,
you
wouldn't
be
able
to
do
the
verification
step
associate
with
perfect
changes.
You
just
made
alright.
A
A
Okay,
let's
reconfigure
that
and
let's
go
back
down
there
and
see,
there's
a
check
in
there
too.
Right
is
so.
This
is
back
to
the
prefect
okay,
so
I'm
just
gonna.
Wait
for
that
reconfigure
to
finish,
and
do
this
again.
Okay,
that
was
fast,
alright
see
clipping,
is
happy
good,
giddily
notes
I.
Actually,
this
is
this
is
I
got
ahead
of
myself
because
I'm
confident
there's
a
lot
of.
B
A
A
C
You
copy
that
and
paste
it
somewhere
stand
and
I'll
put
it
in
the
issue
like
I
just
paste.
This
whole
thing
in.
A
A
Dump
that
in
there
okay,
sorry
for
flooding
the
channel
there-
okay,
so
that's
it!
So
that's
good!
We
dialed
on
the
nodes,
failovers
enable
sequel's,
enabled
and
I've
already
reconfigured.
The
question
I
have
is
whether
the
database
migrations
are
up,
and
we
this
is
an
I'm
just
going
to
show
this
off,
because
this
is
a
relatively
new
thing.
A
Don't
really
need
to
run
it
all
them,
but
what
I
want
to
know
is
so
I
thought
there
was
a.
There
was
a
Omnibus
merge
request
to
make
this
automatic
I
may
not
have
been
merged
in
this
one,
so
I'm
gonna
not
broadcast
it
to
all
them
and
just
migrate.
Is
it
in
that?
It's
not
in
our
documentation
either
in
a
migrate
right?
A
A
To
run
it
with
reconfigure
automatically
so
I
guess
I,
just
I'm,
not
sure
if
I
just
didn't
run
it
or
the
merge
request
actually
isn't
in
here.
So
I'm
just
gonna
run
it
for
now,
but
we
should
check
that
all
right
so
now
that
I
have
that
I
have
the
status
again
and
we
are
good
okay,
so
either
that
function
and
thanks
for
merging
that
Paul,
because
I
was
constantly
getting
annoyed
of
having
to
log
into
the
database
to
figure
out
what
was
in
it.
A
A
A
A
So
the
configuration
looks
for
a
election
strategies.
Failover
is
enable
okay.
So
let's
see
this
is
okay,
so
this
is
all
the
configuration
I've
actually
run
this
already,
but
I
haven't
actually
so.
The
one
thing
I'll
know
to
note
here
is
that
prefix
last
week,
last
time,
I
did
this
demo.
We
weren't
sure
what
to
do
here.
This
is
actually
the
load
balancer
IP,
so
I
have
to
I.
Think
I
had
to
run
and
reconfigure
here.
A
Cuz
I,
don't
know
if
I
actually
saved
it,
but
that
IP
is
basically
what
you
see
here:
10.1
56,
so
it's
cool.
It
actually
shows
me
that
prefix
is
up
because
before
we
started
this
actually
was
unhealthy.
So
it
actually
is
it's
a
TCP
load
balancer.
So
it's
actually
pinging
port
20
305,
which
is
the
GRP
sea
port
and
basically
saying
it's
gonna
round-robin
between
those,
so
cool
good,
so
that
reconfigured
fine,
let's
go
back
to
the
documentation.
A
Right
so
check
worked
great
first,
try
get
love
can
reach
a
prefix.
So
does
it
oh
wait?
This
will
be
interesting
because
I've
actually
never
have
run
this
with
the
load
bouncer
in
front.
What's
going
to
happen,
we
should
get
a
break,
we're
faster,
it's
not
as
satisfying
as
the
other
ones,
all
right.
So
while
that
is
cooking
there,
I
look
Oh
No
field,
Connect,
prefigures
okay,
but
cute
ugly,
Oh
Gilly
is
it's.
A
B
I
have
an
exit,
yeah,
yeah
yeah,
so
well
in
a
Docs
I
made
the
default,
the
the
internal
IP
I,
guess
so
that
you
could
move
posit
or
II
between
shards.
So
you
have
to
you
have
to
actually
enable
the
listen
hundred
to
be
because
right
now,
it's
just
localhost
but
then
so
trying
to
find
it
on
the
internal
IP
I
see.
A
B
A
A
B
B
B
A
B
B
C
B
A
B
A
So
I'm
going
to
go
just
go
ahead
with
this
change:
it
back
the
prefect
now
and
we're
good
alright.
So
let's
create
a
new
one.
This
will
be
actually,
while
this
is
happening.
Let's,
let's
take
a
look
a
fauna,
so
I
can
at
least
see
some
metrics
if
there
are
any
and
I
and
I
actually
haven't,
set
the
password.
This
admin
admin
as
the
default
yep
no
to
have
to
set
the
route.
Oh.
C
C
A
A
B
C
B
C
A
C
A
A
C
A
C
A
A
A
A
This
is
my
failed
attempt
to
use
sequel
cloud
sequel.
B
A
A
C
A
B
The
replication
latency,
but
we
don't
have
the
new
replication
to
line
metrics
if
Johnny
fluency,
actually
Patrick,
did
that
one
right,
yeah,
sorry
yeah
that
one
this
merge,
so
we
could
add.
It's
called
like
giggly
underscore
appreciate,
nose
for
replication
I'm
just
going
to
delay
so
just
change
the
latency
to
delay.
A
A
A
A
B
A
B
B
A
B
B
Like
we've
only
really
looked
at
testing
in
all
these
demos
of
like
node
up
node
down
all
right,
we
haven't
considered
out
of
the
kinds
of
failure
modes
like
a
degradation
like
where
one
no
becomes
much
slower
than
the
others
right
I'm.
Presuming
that
the
the
load
balance
that
doesn't
do
anything
smart
around
that
just.
C
It
just
it
just
operates
on
the
primary
everything
else
is
async.
B
B
Let's
say
one
of
the
pre
kuk
notices
degradation
I
mean
yeah
the
giddily
side
we've
within
the
shard
from
prefect
to
get
early,
but
a
prefix
not
doing
anything
smart.
There
then
the
other
thing
that
I
think
I
create
an
issue
about
the
other
week.
Is
that
what
happens?
If
say,
the
health
check
succeeds,
but
it
doesn't
so
like,
like
imagine
like
the
disk,
is
full
like
what
happens
if
we
would
like
to
be
able
to
fill
the
disk
or
basically
prevent
the
get
user
on
the
primary
from
writing.
B
Like
would
the
failover
still
happen
so
I
don't
know
if
we
can
do
something
to
basically
change
the
user
permission
on
a
giddily
node,
the
current
primary,
so
that
that
primary
no
longer
had
permission
to
write
to
disk
I.
Imagine
the
health
check
would
still
succeed
because
diddly
still
up
right.
It
wouldn't
have
write
permissions
so
like
right,
there's
a
whole
range
of
failure
modes
that
we
don't
consider.
A
C
C
We
can't
really
do
that
with
prefect
right
now,
because
you'd
have
to
you
have
to
have
something
that
correlates.
You
know
the
the
storages
of
the
backends
are
part
of
this
virtual
storage.
How
do
you
do
that
in
a
real
easy
to
manage
way?
You
want
people
creating
dashboards
manually
for
every
virtual
storage
and
figuring
out
which
get
alleys
belong
to
it,
or
should
that
be
automated
to
a
certain
extent?
C
B
The
proposal
I
made
in
the
issue
when
I
was
thinking
about
this
problem.
The
other
week
was
rather
than
using
a
health
check.
We
should
be
logging
failures
of
operations,
so
like
error
rates,
and
so,
if
the
error
rate
over
a
very
short
time
span
of
like
two
seconds
like
on
an
active
node
with
regular
activity
like
if
any
read
or
write
operation
fails
like
five
times
in
a
row
mark
the
node
out
right,
don't
rely
on
a
health
check
because
health
checks.
Let
me
happen
every
one.
B
B
A
A
I
did
create
an
issue
about
that
too,
because
right
now,
prefect
doesn't
consider
I
mean
we're
just
trying
to
we're
just
electing
whatever
node
is
responding
to
health
checks
right,
which
is
not
like
what
happens
when
it
falls
behind
right.
What
happens
if
you
delete
two
needs
ten
seconds
to
catch
up
and
three
is
up
to
date.
Really
you
want
to
go
to
three,
not
two
right,
so
there's
all
sorts
of
decisions
we
need
to
make
about.
A
B
B
Seems
to
me,
like
the
replication
delay
metric,
might
also
be
artificially
high.
Oh
I
give
those
numbers,
because
the
first
yeah
the
replication
today
I
think
it's
like
incorrectly
high
because
essentially
like
if
we
put
10
jobs
on
the
queue
for
the
same
repo
like
if
we
process
one
all
10
jobs
will
probably
be
up
to
date,
because
we've
mirrored
the
whole
thing.
So
all
10
jobs
are
there
no
ops,
and
so,
if
you've
got
like
a
higher,
sustained
right
volume
right.
A
B
B
A
B
A
B
It'd
be
super
interesting
to
know
what
the
right
operations
per
second
like
on
a
per
repository
basis
like
we
know
what
they
are
at
an
instant's
level.
Well,
I
could
get
lino
level,
but
we
don't
know
like
for
this
specific.
What's
the
cross
section
like
for
a
single
repo
like
what's
the
50th
percentile
for
right,
opps
per
second
by
repo
and
then
99
percent
up,
because
really
it's
that
99th
percentile
and
write
operations
per
second
at
a
repository
level,
that's
really
like
the
limiting
factor.
A
A
A
B
No
actually,
no,
no.
You
read
something
to
get
loud.
Note.
Sorry
cuz,
you
passin
a
project,
ID,
okay,
so
it's
let's
get
loud,
:.
A
B
B
But
yeah
yet
passing
the
probaby.
B
B
A
B
B
C
A
C
To
replicate
anything,
that's
inconsistent
with
the
current
primary
or
you
can
point
it
at
whatever.
No
do
you
want
so
there's
actually
some
Doc's
on
that
in
the
on
the
prefect
doc
page.
It's
I
think
it's
closer
to
the
bottom,
with
multiple
prefix,
since
they
all
have
the
same
configuration
you
just
need
to
do
it
on
one
of
them.
C
C
Now
it
is
yeah
we,
the
first
step
was
making
it
manual
and
we
were
talking
about
when
a
node
rejoins
after
a
certain
amount
of
time.
We
would
run
this
as
a
like
sanity
check
to
make
sure
that
the
user
didn't
like
remove
the
disk
and
swap
it
out.
For
example,
there's
like
a
disk
failure.
The
problem
with
that
is
like
you
get
into
problems,
you
you're,
defining
like
arbitrary
amounts
of
time
to
say
that
someone
could
have
swapped
the
disk
or
there
could
have
been
damaged
or
something
means
DJ.
C
We're
also
talking
about
putting
some
kind
of
token
on
the
fall
I
guess
we
have
a
filesystem
UUID
that
we
could
use,
and
if
that
you
ideas
change,
then
that
means
it's
not
the
same
file
system
anymore.
So
that
was
another
thing.
I
don't
know.
If
we
have
an
issue
open
for
that
right
now,
cuz
we
had
that
MVC
issue,
I'm,
not
sure
if
we
follow
it
up
off
the
chart.
B
B
C
A
B
Want
to
just
be
like
oh
yeah,
like
you,
would
just
be
like
an
inconsistent
owed
like
next
replication
job.
It
would
be
like
okay
can't
make
quorum,
let's
like
resync
it,
and
then
the
only
concern
is
failover.
How
do
you
handle
favor
over,
like
until
the
node
is
fully
in
sync?
You
can't
really
fail
over
that's
I.
Guess,
like
the
problem,
yeah
see
the
guys
stuck
he's.
B
But
feel
the
persist
replication
to
been
Oh
I,
wonder
if
it's
having
issues
talking
to
the
database.
A
B
A
A
B
A
B
A
C
A
A
A
B
A
A
A
A
A
A
A
A
B
Yeah
so
I
think
I
really
like
it.
If
anyone's
like
got
any
other
ideas
like
the
kinds
of
failure
modes
that
could
occur
that
are
not
just
like
the
server
going
dark.
There
would
be
my
guess:
it's
like
Commission's
problems,
I
say
you
like
incorrectly
configured,
one
of
the
giddy
nodes
and
like
it
was
working,
and
then
you
push
some
config
change
to
your
fleet
and
you're,
incrementally
rolling
out
a
config
change,
and
some
percentage
of
your
nodes
were
like
a
partially
operational
but
like
essentially
unusable
so
like
the
data
directory
was.
B
You
commissioned
it
wrong
or
storage
was
full
trying
to
think
of
other
ones,
like
the
other
one
that
comes
to
mind
that
that
that
who
could
be
like
some
sort
of
a
latency
one,
we're
like
one
node
is
getting
really
slow
and
unresponsive.
I
mean
I.
Think
our
timeouts
are
quite
high.
So
like
we
wouldn't
catch
an
error
because
it
would
just
be
slow.
But
then,
if
you
like
got
a
different
shard,
it
would
be
fast
and
I.
B
C
Think
that's
the
most
interesting
edge
case
I'm
worried
about
because,
somewhere
in
our
you
know,
interactions
between
the
prefix
and
the
database.
There
could
be
some
kind
of
bottleneck
that
isn't
exposed
until
one
of
the
nodes
is
really
slow
because
we're
kind
of
doing
distributed,
locking
of
this
data
set
and
what
happens
when
one
of
them
is
just
really
slow.
Maybe
we
want
to
have
some
kind
of
slow
mode
where
we
like
inject
sleeps
and
stuff
into
various
actions.
A
A
Do
is
like
create
a
matrix
or
a
spreadsheet,
of
like
failure.
Modes
right
like
this
is
what
you
do
in
the
auto
industry
or
some
safety
critical
thing.
You
write
down
like
possible
things
that
go
wrong
and
level
the
risk
and
then
maybe
like
what
are
you
gonna
do
about
it
kind
of
thing
right,
so
it's
a
figure
out
what
the
term
is,
but
it's
fairly
basically
a
failure
analysis,
and
we
need
to
do
something
like
that
here,
ya,.