►
From YouTube: 20200522: Speed Run - Gitaly backups
Description
Share your thoughts here https://gitlab.com/groups/gitlab-org/-/epics/2094
A
Hi
this
is
James
Ramsay
group
product
manager
here
at
get
lab
and
I'm
looking
into
improving
the
way
we
allow
you
to
back
up
your
git
repository
data
on
your
lab,
so
I've
got
some
ideas
from
reading
a
couple
of
issues
and
feedback
from
customers,
but
I
thought
the
best
place
to
start
exploring
was
to
actually
try
and
do
a
back
up.
So
I've
created
get
lab
instance
here
and
imported
a
no
hate,
also
large
popular
repositories
and
I'm
gonna.
A
Try
and
back
them
up
so
I've
gone
to
the
documentation
and
I
want
to
use
cloud
storage
and
object,
storage
to
store
the
backups,
so
I've
gone
and
configured
Google
storage,
at
least
I
think
I
have
a
fan.
I
did
that
before
the
recording.
So
you
didn't
see
my
tokens,
but
this
is
the
bucket
I'm
aiming
to
put
the
data
in,
as
you
can
see,
there's
nothing
in
here.
So
let's
go
to
the
docks
and
see
what
has
to
say.
Okay,.
A
A
There's
a
this
node
was
configured
with
the
giddily
cluster,
so
it's
three
prefect
at
3
Ghibli
nodes,
so
those
large
repositories
should
actually
be
living
on
these
three
nodes.
Think
it's
about
looks
like
Navy
got
eight
or
so
gigabytes
free
pose.
Okay.
So
here
you
can
see
it
doing
some
jumping
wiki's.
A
A
A
A
A
A
A
A
A
A
A
A
It's
not
showing
the
screen
again:
okay,
so
I
just
treated
the
input
back
there.
If
we
go
to
projects
we'll
see
what
kids
have
imported,
that
should
be
quite
a
lot
larger
and
slowed
things
down,
but
in
the
meantime,
let's
take
a
look
at
these
options
and
see
what
we
can
exclude
from
the
backup.
So
you
can
just
focus
on
repositories,
because
that's
really
what
I'm
interested
in
so
skip
DB
upwards.
A
A
A
A
A
Okay,
so
I,
already
configured
this
one
in
kind.
So
yes,
the!
What
we're
trying
to
work
out
here
is
how
we
might
make
these
backups
faster
and
there's
once
a
repository
becomes
of
a
certain
size
if
you've
got
thousands
of
repositories,
even
if
they
only
take
a
minute
each
or
even
30
seconds,
each.
Your
backup
quickly
takes
hours
and
hours
and
hours,
and
it's
not
unforeseeable
that
you
would
have
a
lot
of
large
projects,
be
they
Forks
of
open-source
projects
or
huge
internal
projects
that
you've
developed
over
many
years.
A
If
you're
trying
to
restore
from
backup,
because
then
you'll
find
that
your
git
repository
is
corrupted
and
it's
going
to
need
manual
repair,
which
it's
going
to
be
quite
stressful,
when
you're
trying
to
recover
from
backup
generally
is
enough
enough
stress
involved
in
a
backup
without
added
data.
Corruption
concerns.
So
we
want
to
make
sure
that
the
backups
are
consistent.
A
A
See
what
just
coming
slowly
reporting
three
point,
four
gigabytes
used,
which
seems
so
this
one
is
the
secondary,
so
it
won't
have
the
disk
utilization
and
updated
until
this
is
complete.
So
let's
take
a
look
at
this
one.
A
A
A
A
So
I
guess
one
of
the
ideas
and
would
be
that
a
if
we
could
keep
track
of
these
backups
rather
than
treating
them
as
like
specific
time
stamped
repository
backups,
but
instead
calculating
the
checksum
of
the
bundle.
So
when
we
generate
let
me
rewind
so
with
git
repositories,
we
have
a
method,
an
internal
method
for
calculating
a
checks
up
for
the
repo.
So
we
can
run
that
and
work
out
if
two
servers
have
the
exact
same
state
of
that.
A
So
if
we
stored
that
checksum
with
the
bundle
somehow,
then
we
could
compare
the
backup
we
last
took
with
the
current
checks
of
the
repo,
and
so,
if
I'm
taking
another
backup
and
I
just
took
one
12
hours
ago
and
the
checksum
hasn't
changed,
then
maybe
I
don't
need
to
back
that
repo
up
again
so,
depending
on
the
level
of
churn,
it
could
cut
the
number
of
backup
jobs
to
a
fraction
of
a
percentage.
Maybe
it's
a
5%
or
1%.
A
It's
it's
likely
to
be
a
significant
savings
and
then
even
further
than
that
we
could
iterate
on
taking
even
smaller
snapshots,
rather
than
taking
a
full
snapshot
every
time.
The
repo
changes
we
could
look
at
taking
an
incremental
backup,
so
those
are
some
areas
that
we're
exploring
and
we'll
need
to
work
out
how
this
plays
into
not
only
the
backup
process,
but
also
how
we
restore
now.
A
The
data
would
be
structured
in
a
an
object,
storage
environment
where
we'd
presumably
typically
store
these
kind
of
backups,
but
then
again
these
could
just
also
be
on
any
file
path.
So
those
are
some
ideas
that
we're
evaluating
yeah
looks
like
things
are
quite
basic,
quite
slow
even
for
a
instance
with
only
a
handful
of
large
repositories.
A
This
thing
was
really
slow.
All
those
repositories
were
less
than
a
gigabyte,
and
it's
really
not
that
uncommon
for
large
organizations
with
historic
projects
that
have
been
going
for
10
plus
years
to
have
much
larger
projects
than
this,
which
would
take
longer
to
backup
I'm
so
important
that
we
solve
these
problems.