►
From YouTube: Git Internals: a Database Perspective - Git Merge 2022
Description
Presented by Derrick Stolee
The inner workings of Git's object database can be a mystery to most users. When you incorporate a database into your infrastructure, you are expected to learn about database internals such as table indexes, query plans, and sharding. Similar features exist in Git, and learning about them can advance your use of Git to the next level.
About Git Merge:
Git Merge is dedicated to amplifying new voices in the Git community and showcasing the most thought-provoking projects from developers, maintainers, and teams around the world. Git Merge 2022 took place at Morgan Manufacturing in Chicago, IL on September 14th and 15th.
A
A
I
am
so
excited
to
be
surrounded
by
like-minded
folk,
such
as
you
today,
I
want
to
talk
about
an
idea,
and
this
idea
should
not
be
controversial
or
even
surprising
to
this
audience,
but
I
hope
that
it
gives
us
a
framework
and
a
vocabulary
to
use
as
you
go
and
leave
this
bubble
of,
get
super
fans
and
go
back
to
your
own
organizations
spreading
the
good
word.
The
idea
is
this:
git
is
the
distributed
database
at
the
core
of
your
engineering
system.
A
When
you
think
about
it,
git
is
the
center
of
our
collaboration
infrastructure.
Not
only
does
it,
let
us
and
multiple
developers
work
concurrently
on
the
same
repository,
but
also
links
to
our
build
and
test
infrastructure.
It
dictates
which
versions
are
deployed
or
released
to
customers
and
stakeholders
tend
to
watch
the
repository
to
measure
progress
as
a
parallel.
Your
application
database
is
the
core
of
your
application
infrastructure.
A
The
application
database
is
used
to
store
all
the
data
that
you
are
manipulating
and
serving
through
all
of
your
services.
Your
background
jobs
process,
data
from
the
database
asynchronously
from
user
requests.
Your
infrastructure
probably
has
some
ways
of
doing
database
backups
and
failover
Remediation,
and
don't
forget
that
your
database
health
is
monitored
by
support
and
SRE.
A
A
A
So
we've
already
seen
three
talks
today
that
about
Engineering
Systems
investing
heavily
in
git
and
to
make
it
work
exactly
with
their
needs.
These
talks
from
Uber,
Twitter
and
canva
show
what's
possible
when
you
make
that
kind
of
investment,
but
you
know
not.
Every
engineering
system
is
large
enough
to
really
Merit
that
kind
of
investment.
Many
organizations
really
rely
upon
get
the
get
client
out
of
the
box
or
the
features
of
a
their
get
host
of
choice.
A
A
So
here's
my
personal
pairing
of
Concepts
from
application
databases
and
how
they
pair
with
Git
Concepts
at
their
core
application.
Databases,
store
tables
of
data,
get
stores
objects
in
its
Object
Store.
In
order
to
access
or
manipulate
data.
Databases
have
query
languages
gets
crew.
Language
is
its
command
line
interface.
A
A
Now
we
can
think
about
git
as
a
decentralized
version
control
system,
so
it's
sort
of
like
a
distributed
database
distributed.
Databases
need
to
have
mechanisms
to
synchronize,
as
concurrent
requests
are
coming
in
and
manipulating
data
git
is
disconnected
by
default,
but
still
we
need
to
be
able
to
synchronize
across
repositories
on
user
demand
through
fetches
and
pushes
and
finally,
as
applications,
databases
grow
beyond
the
limits
of
a
single
node.
A
This
is
what
I
mean
by
giving
you
a
framework
and
a
vocabulary,
hopefully
that
you
can
use
these
Concepts.
When
you
go
back
to
your
organizations
and
talk
to
your
fellow
Developers
by
starting
with
this
common
understanding
of
application
databases,
this
might
help
bridge
the
gap
to
those
git
Concepts.
A
A
A
A
The
reference
name
is
the
primary
key,
and
these
are
human
created
names
that
give
us
pointers
into
the
object
store.
In
this
example,
we
have
the
refs
tags.
V
2.37.0
ref
is
pointing
to
an
object
in
the
object
store.
If
we
load
that
object's
data,
we
see
an
annotated
tag,
annotated
tags
have
a
human
written
message,
but
they
also
point
to
another
object
in
the
object
store
using
the
object.
Id
that
object
in
this
case
happens
to
be
a
commit,
commits,
have
Commit
messages
and
they
link
to
later
parents.
A
But
specifically,
let's
look
at
look
at
this
object.
Id
for
its
root
tree.
The
contents
for
that
we
have
a
tree
that
contains
many
tree
entries
which
correspond
to
what
file
or
path
do
I
see
at
these
different
names.
So
specifically,
look
at
the
readme.md
entry
and
find
that
object
ID,
and
when
we
look
at
its
contents,
we
see
a
blob
object,
which
stores
the
contents
of
that
file.
A
A
A
I'll
always
keep
the
commit
history
there
at
the
top,
followed
by
a
row
of
root
trees
and
then,
as
you
scan
down
through
the
entries
of
those
root
trees
and
you
get
deeper
down,
we
see
that
a
lot
of
things
are
actually
shared,
even
though
each
commit
is
a
snapshot
of
the
work
tree
at
a
good
point
in
time,
two
commets
actually
share
a
lot
of
objects
in
common.
This
Merkle
tree
representation
is
the
first
way
that
git
saves
the
size
of
your
Object
Store,
even
as
users
are
making
changes.
A
A
A
That's
common
to
repositories
here:
I've
highlighted
similar
objects
based
on
where
they
appear
across
multiple
commits,
so
at
the
top
we
have
all
of
our
root
trees,
which
correspond
to
the
base
of
the
work
tree
across
all
these
points
in
time,
and
you
can
also
think
about
these
three
blobs
at
the
bottom.
Maybe
they
correspond
to
the
same
source
code
file.
A
Git
can
expect
that
these
objects
will
actually
share
a
lot
of
data
in
common
because,
as
software
developers,
we
rarely
change
a
huge
amount
of
the
code
at
a
time
instead
making
very
calculated
small
changes
when
we
have
that
kind
of
data
get
can
use.
An
extra
type
of
compression
called
Delta
compression.
A
A
A
In
this
way,
we've
reconstructed
the
decompressed
object,
but
we
stored
those
two
objects
using
significantly
less
space
than
if
we
had
them
fully.
Decompressed
now
I
mentioned
that
loose
objects
can't
take
advantage
of
Delta
compression.
Instead,
we
need
a
different
format
to
think
about
that
and
that's
where
Pac
Files
come
in
Pac
files
essentially
take
all
of
the
object
contents,
either
in
base
form
or
in
delta
form
and
concatenate
them
in
a
list
they're
all
packed
together.
A
This
is
a
very
efficient
way
to
store
the
data,
but
if
I
come
in
with
an
object,
ID
that
I
want
to
get
its
contents,
I
can't
parse
the
PAC
file
and
expect
that
to
run
really
quickly,
I'd
have
to
scan
every
object
rehash
it
to
see
if
it
matches.
Instead,
git
has
a
custom
query
index
called
a
pack
index.
It's
a
DOT
idx
file
to
match
the
dot
pack
file.
The
first
thing
git
does,
when
it
has
this
input
object.
A
A
Within
that
range,
we
can
do
binary
search
to
find
the
exact
object,
ID
we're
looking
for
and
once
we
have
that
position
inside
the
sorted
list
that
corresponds
to
another
position
inside
of
a
list
of
offsets
and
that
offset
provides
us
the
original
position
of
the
object
in
the
pack
file.
So
we've
found
our
object
great
now.
What
do
we
do
with
it?
Well,
the
initial
segment
of
that
object.
Data
includes
a
type
and
a
length
which
lets
us
know
how
much
data
of
the
pack
file
corresponds
to
this
object.
A
If
the
object
is
a
Delta,
the
contents
of
that
object
are
include
an
offset
value
to
a
base
object
previously
in
the
pack
file.
It
can
then
take
those
two
objects
and
do
Delta
decompression
to
reconstruct
the
full
object.
Contents
that
match
the
input
object.
Id.
Does
this
thousands
of
times
as
it's
running
your
git
commands?
So
this
is
a
very
fast
operation
that
happens
repeatedly.
A
A
A
The
first
thing
that
happens
is
the
client
asks
the
server
for
a
list
of
references
and
the
server
provides
a
list
of
references
and
their
object
IDs
at
its
current
point
in
time.
The
client
scans
these
references
and
says
these
are
the
ones
that
are
important
to
me
and
then
also
notices.
These
are
the
objects
that
aren't
in
my
Repository,
based
on
from
this
point
on,
though,
all
the
communication
will
take
place
via
object.
Id
in
case
the
server
has
its
references
move.
A
A
A
The
server
can
now
infer
that
the
client
actually
has
all
of
the
objects
reachable
from
that
commit
giving
this
region
of
objects
to
be
known
to
be
on
the
client
and
therefore,
what
we
need
to
do
is
we
need
to
find
the
objects
that
are
reachable
from
the
wants,
but
not
reachable
from
the
halves
in
order
to
satisfy
this
fetch
request.
It
gives
you
this
region
of
objects.
A
So
in
this
way,
even
though
the
client
may
have
sent
a
small
list
of
haves
and
wants,
that
may
correspond
to
a
very
large
set
of
objects
known
to
the
server,
but
now
that
the
service
figured
out
what
the
client
needs
it
can
take.
Those
object,
contents
concatenate
to
them
together
and
put
them
in
a
pack
file
and
send
that
pack
file
over
the
wire
and
again,
this
pack
file
can
use
full
objects
and
it
can
use
offset
Deltas
to
previous
objects
in
the
pack.
We
can
use
an
additional
type
of
compression.
That's
really
helpful.
A
In
this
case,
we
can
use
a
thin
pack,
which
has
an
addition.
Reference
Deltas,
instead
of
pointing
within
the
same
pack,
file
to
a
base
object.
It
can
point
to
an
object
via
object,
ID
and
that
object
is
expected
to
exist
on
the
client
based
on
that
list
of
halves.
This
gives
us
even
further
compression
than
we
would
have
had
before
now
you
may
have.
A
So
you
might
be
thinking
it's
great.
That
git
has
my
back
and
is
doing
all
these
complicated
things
under
the
hood
to
make
my
git
commands
fast.
But
what
can
I
do
about
that?
Well,
I'm,
here
to
say
that
you
are
in
control
of
your
repository.
You
determine
its
shape
and
you
can
influence
the
Norms
of
your
organization,
so
I'm
going
to
get
into
some
really
big
picture
items,
but
first
I
want
to
give
you
a
couple,
quick
tips
that
you
can
use
to
take
advantage
of
some
things
we've
already
talked
about.
A
The
first
is
that
you
should
run
git
maintenance
start
and
all
of
your
favorite
repositories.
This
will
start
running
some
background
maintenance,
including
hourly
fetches,
to
all
of
your
remotes,
which
makes
your
foreground
fetches
a
lot
faster
and,
in
fact
each
of
those
fetches
will
be
fast
because
it's
getting
a
smaller
set
of
objects.
It
also
will
repack
your
repositories,
Object
Store,
incrementally
every
night,
making
sure
that
you
are
saving
data
on
disk,
while
also
still
having
fast
lookups
for
your
git
commands.
A
The
other
thing
I
want
to
mention
is
that
you
should
use
good
repository
hygiene.
You
really
want
to
take
advantage
of
Delta
compression
whenever
possible.
The
good
news
is
that
objects
that
don't
compress
well
also
don't
really
diff
well
or
merge
well,
and
so
they
don't
present
as
reviewable
changes
in
your
pull
requests.
If
so,
why
are
they
in
your
Source
control
system?
A
A
A
A
A
A
Now
you
don't
take
my
word
for
it,
because
you
saw
Emily
talk
earlier
today
about
sub
modules.
So
I'll
refer
all
questions
to
her
as
the
expert
in
the
space.
The
one
thing
I
can
say
is
that
it's
difficult
to
move
into
a
model
of
sub
modules.
If
you
didn't
start
out
that
way,
it's
hard
to
carve
out
pieces
of
your
repository
that
you
could
be
treated
as
independent
sub-modules,
so
it'd
be
nice.
If
we
could
do
something
where
we
didn't
need
to
modify
the
work
tree,
but
we
could
still
do
some
Port
of
sharding.
A
So,
let's
take
a
look
at
this
repository
and
imagine
it
has
a
huge
commit
history,
but
we're
going
to
focus
on
a
single
tip
at
the
moment
to
make
a
time-based
chart.
We
can
archive
this
repository
and
stop
writing
to
it,
but
we
can
create
a
new
repository
with
a
single
root
commit
whose
root
tree
is
identical
to
the
root
tree
of
the
previous
tip,
and
this
way
they
would
have
the
same
checkout.
The
only
thing
is:
you've
lost
all
the
commit
history.
A
A
A
In
this
way
you
can
still
have
that
full
history
view
when
you
need
it,
but
this
isn't
a
very
efficient
way
to
be
so
I,
don't
recommend
keeping
it
this
way
all
the
time,
but
it's
there
when
you
need
it
and
as
you
move
forward
in
the
new
repository
you'll
need
it
less
and
less
often
the
next
strategy
isn't
actually
a
sharding
strategy,
but
it
comes
from
the
idea
of
data.
Offloading
databases
can
offload
data,
that's
not
infrequently
used
to
cheaper
storage
that
and
then
keep
the
fast
and
expensive
storage
focused
on
the
important
pieces.
A
So
if
we
want
to
think
about
this
in
the
git
world,
we
can
think
about.
Another
object.
Graph
so
commits
are
super
cheap
to
store
and
they're
really
important
to
many
git
commands.
So
we're
going
to
think
about
them
as
a
really
important,
but
further
all
their
root.
Trees
are
really
going
to
be
Delta
compressed
well
and
they're,
also
very
frequently
used
by
git
commands.
So,
let's
consider
all
commits
and
root
trees
as
important
critical
data.
A
A
We
can
take
a
copy
of
the
full
repository
put
it
on
a
network
share
or
read-only
network
share
and
use
it
as
a
get
alternate
and
have
our
local
repositories
with
our
expensive
local
desk
Focus.
Only
on
these
critical
new
datas
I
think
this
is
a
really
cool
idea.
I,
don't
know
if
anyone
has
ever
done
it.
So,
if
you're
looking
for
a
project,
give
it
a
try.
A
The
first
is
that
we've
talked
about
charting
strategies,
but
the
only
blessed
sharding
strategy
is
sub
modules
in
terms
of
get
features
right
now
then,
independent
multi-repose
is
already
there's
not
really
support
because
you're
going
outside
the
boundaries
of
git
at
that
point,
but
these
time-based
shards
and
data
offloading
are
something
that
maybe
we
could
consider
adding
to
get
as
a
feature.
Maybe
we
can
make
a
magic
button
to
create
these
types
of
shards
or
third-party
tools
could
probably
create
these
shards
without
even
modifying
the
git
client,
so
there's
lots
of
possible
directions
here.
A
The
second
idea
is
that
I
talked
about
pack
files
a
lot
and
how
the
same
format
is
used
for
network
transfer
and
for
our
on
disk
format.
Now
these
pack
files
are
immutable
once
they're,
written
Taylor
was
talking
a
lot
about
the
full
repacking
as
being
very
expensive,
and
that's
because
we
can't
just
add
a
few
objects
to
a
pack
and
move
on.
We
need
to
essentially
repack
all
of
our
objects
into
a
new
pack
and
then
delete
the
old
objects
which
gets
really
expensive
when
you
have
a
very
large
repo.
A
A
A
So
I
have
limited
time
today,
but
I
went
super
deep
in
all
these
Concepts
in
a
five-part
blog
series
on
the
GitHub
engineering,
blog
I
hope
this
has
inspired
you
to
go,
take
a
look,
or
at
least
use
as
a
reference
in
the
future,
and
finally
I
want
to
leave
you
with
this.
Make
sure
if
you
didn't
learn
anything
else
at
this
entire
talk.
Is
this
one
concept?