►
Description
We’ll discuss our experience running a very large Git repository with the many projects and contributors, discuss the pros and cons of large repositories, introduce some enhancements we are testing to improve fetch performance through journaled replication, and cover some other optimizations we are pursuing.
A
Without
further
ado,
I'm
gonna
invite
our
next
speaker
up.
He
works
at
Twitter.
It's
at
scaling,
get
at
Twitter,
some
of
you
who's
heard
of
Twitter.
It's
anybody
nobody's
heard
of
Twitter.
Okay,
it's
a
it's
sort
of
a
social
app.
It's
not
big
and
France.
So
but
it's
quite
big
in
the
United
States,
so
he's
going
to
tell
you
all
about
how
they
scale
get
it
at
Twitter.
So
please
welcome.
Will
beer
bomb
I.
B
Yeah
I
have
worked
previously
on
the
front
end
systems
and
the
traffic
management
systems,
and
now
I
work
on
get
at
Twitter.
We've
basically
decided
that
it's,
it's
really
important
to
make
Twitter
a
good
place
to
work
for
developers,
so
source
control
was
one
of
those
areas
where
we're
kind
of
lacking.
B
So
Twitter
now
runs
development
of
all
of
its
services.
Out
of
a
single
monolithic
repository,
it's
one
huge
repository
that
used
to
be
three
somewhat
large
repositories
for
better
or
worse
working
in
a
single
repository
when
you
have
a
lot
of
services
really
allows
us
to
well
working
in
a
single
repositories.
The
way
that
people
prefer
to
develop
software
when
faced
with
developing
hundreds
of
services
and
dealing
with
thousands
of
build
targets,
in
addition
to
being
in
a
single
repository
at
this
point,
we
have
a
single
build
system
that
helps
us
build
everything
consistently.
B
B
For
instance,
why
would
you
put
everything
in
one
repository?
Well,
one
thing
is
that
it's
visible
it's
easier
for
people
to
find
code
if
they're
looking
for
one
repository
while
code
search
solutions
that
can
target
more
than
one
repository
obviously
exists,
and
especially
in
github
and
these
kinds
of
things
for
enterprises
they're
not
as
fluent
so
it's
just
simpler
instead
of
asking
where
the
code,
which
repository
the
code
that
I'm
looking
for,
is
in
to
have
it
all
in
one
place.
B
In
addition,
we
run
a
single
tool
chain,
so
there's
a
single
set
of
tools
to
build
test
and
deploy
and
operate
the
services
that
are
developed
in
the
repository
when
people
make
improvements
to
these
tools,
everyone
benefits
because
everything's
on
a
single
tool
chain,
we
actually
do
quite
a
bit
of.
We
do
rely
quite
a
bit
on
IDL
interface,
definition
language.
B
B
It's
also
easier
to
understand
the
impact
and
scope
of
changes
that
you
make
so
using
a
single
bull
tool
and
repository
has
benefits
for
the
planning
aspect
of
coding.
Since
everything
can
be
compiled
together,
it's
easy
to
make
a
change
and
see
what
breaks,
rather
than
having
to
submit
changes
to
multiple
repositories,
build
those
repositories,
code
and
change
the
dependencies
among
them.
We
can
just
edit
files
land
the
changes
that
might
affect
multiple
systems
and
do
this
all
within
a
single
commit
and
then
run
the
tests
and
see
what's
happened.
B
B
There
are
many
objects
required
for
the
complete
representation
of
an
entire
repositories.
History,
history
and
those
occupy
considerable
storage
resources
and
great
numbers
of
them
can
cause
normal
operations
like
git
status,
to
perform
quite
slowly
tuning.
The
file
system
only
goes
so
far
and
that's
why
people
end
up
limiting
them
out
of
history
and
possibly
the
parts
will
be
partitioning
their
repositories.
B
One
example
of
partitioning
in
a
large
repository
that
many
people
might
be
familiar
with
is
what's
done
in
the
android
project.
The
maintainer
x'
of
the
android
project
have
chosen
to
partition
their
build
tree
into
many
smaller
repositories,
and
then
they
use
an
external
tool
called
repo
to
synchronize
the
projects
that
are
at
the
top
level.
So
there
may
be,
you
know,
android
at
the
top
level
and
the
code,
that's
part
of
that.
B
B
The
reality
is
that
a
lot
of
developers
don't
feel
very
comfortable
with
sub-modules,
despite
the
fact
that
sub
modules
have
really
improved
a
lot
over
the
course
of
history
of
gets
development
so
commands
that
use
in
everyday
get
such
as
like,
add
or
you
know,
commit
they.
They
don't
recognize
sub
module
boundaries
very
well.
So
if
you
have
changes
you
make
in
a
sub
module
and
you're
at
the
top
level,
you
won't
be
able
to
actually
add
and
make
these
commits
in
parallel.
B
So
I'm
going
to
talk
about
how
we
use
git
at
Twitter
for
a
minute
we
use
git
in
a
very
centralized
way
and,
unfortunately
and
I
think
this
is
a
case
for
a
lot
of
people.
Certainly
people
who
use
github.
We
don't
really
benefit
from
the
fact
that
it's
a
distributed
version
control
system
beyond
the
fact
that
it's
good
at
managing
merging
patches
so
for
development
of
the
services.
B
We
try
to
do
it
as
close
to
master
as
possible
and
we
discourage
maintaining
long-running
branches.
The
goal
is
to
have,
as
as
close
as
is
possible
view
of
the
entire
system
in
a
single
version
having
everyone
works
against
master
limits,
the
amount
of
coordination,
that's
necessary
between
developers
to
make
changes
and
has
a
secondary
effect
of
minimizing
the
number
of
conflicts
developers
have
to
resolve
when
they
merge
their
changes.
B
B
Our
topology
is
that
we
have
a
lot
of
read-only
replicas
that
mirror
changes
from
a
highly
available
readwrite
server
as
developers
push
changes.
Those
changes
are
written
through
to
the
read-only
cluster
in
any
given
get
installation.
It's
likely
that
there
are
a
lot
of
applications,
for
instance,
continuous
integration
and
tooling,
that
read
data
more
than
those
that
write
to
it
and
so
scaling
out
our
read-only
cluster
has
helped
us
meet
the
demand.
B
In
the
context
of
our
read
heavy
workload,
we
use
a
reference
repositories
extensively
when
we're
doing
parallel
test
runs
and
these
kinds
of
things
where
you
have
many
copies
of
a
repository
that
need
to
be
present
on
a
machine
when
you
are,
but
you
know
they
don't
actually
necessarily
need
an
entire
clone
themselves.
So
reference
repositories,
if
you're
not
familiar
it,
allows
you
to
have
an
entire
object,
backing
store
with
a
separate
working
copy
and
a
separate
log,
and
all
these
things.
B
B
So
if
there
are
a
lot
of
changed
branches-
and
you
haven't
fetched
for
quite
some
time-
you
might
run
into
some
locking
problems
because
of
the
fact
that
there
aren't
that
there
are
pretty
low
file.
Descriptor
limits
on
certain
operating
systems,
particularly
Mac
OS
10,
in
its
default
configuration,
runs
into
these
problems.
So
when
you
change
references,
you
actually
have
to
take
a
lock
file
for
the
reference
you
have
to
check
a
lock
file,
possibly
for
the
pack
drafts
and
so
on.
B
So
when
you
have
a
lot
of
these,
you
might
run
to
the
file
descriptor
limit
transactions
might
fail,
which
is
not
great
common
commands
like
status,
and
these
commands
also
take
quite
a
while,
sometimes
in
the
presence
of
many
objects
and
references
regardless
of
the
weather.
The
repository
is
well
packed.
B
We're
experimenting
with
several
changes
to
improve
the
performance
of
these
repository.
Our
goal
is
to
make
performance
of
fetch
push
status,
commit
and
branch
faster.
Specifically,
since
these
are
the
most
commonly
used
commands
so
to
improve
status
performance,
some
people
might
be
familiar
with
file
alteration.
Monitors
Facebook
has
actually
put
this
into
action
in
their
mercurial
implementation
and
they
have
developed
a
alteration,
monitor
called
watchman
that
mercurial
consults
to
see
which
files
have
changed
since
the
last
time
they
asked-
or
there
are
some
other
semantics,
but
largely
this
is
the
case.
B
So
what
are
you
doing?
Get
status?
One
of
the
most
expensive
things
is
to
make
the
system's
system
calls
to
do
the
statting
for
to
see
which
files
have
changed
since
last
time
and
watchman
alleviates
that
by
pulling
that
all
into
the
user
land
process.
So
this
has
a
pretty
pronounced
effect
on
Mac
OS
10,
which
pretty
much
all
of
our
developer
reviews
and
pretty
much
every
developer
uses
since
hfs+
has
pretty
poor
performance
compared
to
Linux
and
xt4.
B
When
the
default
Cannell
configuration
is
used,
we
also
have
a
new
index
format,
the
index
tracks,
the
state
of
files
and
get,
and
we've
adopted
a
faster
hashing
algorithm
that
uses
native
instructions.
So
it
allows
our
index
to
perform
a
little
bit
better.
So,
as
I
mentioned
before,
when
you
connect
to
the
server
it
sends
you
all
these
things
that
you
might
already
that
you
probably
already
have
so
as
soon
as
the
clients
connect,
they
receive
the
wrist
list
of
references
in
the
context
of
large
repositories.
B
With
many
references,
the
list
can
be
huge
in
our
case.
It's
about
13
megabytes
of
raw
data
and
all
this
data
has
to
be
sent
each
time,
despite
the
fact
that
there's
only
a
small
fraction
of
branches
and
tags
that
might
have
changed
between
fetches
to
address
sending
this
huge
piece
of
data,
we've
started
experimenting
with
having
clients
and
a
bloom
filter
representing
the
present
state
of
their
references.
B
B
So
this
means
that
when
you're
actually
fetching
it
can
take
minutes
to
negotiate
which
data
you
actually
want,
and
then
the
transfer
time,
on
top
of
that
to
get
that
data.
So
bitmap
indices,
help
in
this
area,
but
they're
pretty
computationally
expensive
to
keep
up
to
date
for
us,
especially
when
they
need
to
be
repacked,
and
since
we
have
a
central
deployment
and
pretty
significant
requirements
to
have
the
same
data
on
multiple
machines.
B
B
B
B
After
the
pack
is
upended.
We
also
append
a
record
of
which
reference
was
modified,
so
you
have
the
entire
data
necessary
to
extract
all
the
changes
and
to
know
which
ref
exchanged
in
this
case,
and
these
are
saved
to
something
called
the
extents
file.
So
there
are
two
files:
there's
the
extents
file.
There's
the
journal
file.
B
If
the
clients
are
configured
to
use
these
journals,
when
they
connect
to
the
server
they
pull,
they
they
attempt
to
retrieve
the
extents
file.
All
the
data
in
these
files
is
append
only
so.
This
synchronization
can
actually
be
achieved
purely
by
requesting
the
rest
of
the
bytes.
Beyond
the
link
of
the
current
extents
file,
the
client
already
downloaded
or
was
provisioned
with
I'll
talk
a
little
bit
more
about
the
client
provisioning
in
this
case
later
so
after
they
receive
that
data
in
the
extents
file,
they
know
which
references
which
parts
of
the
journals
have
changed.
B
They
try
to
retrieve
the
journal
or
journals,
and
then
they
extract
data
from
these.
The
way
that
they
extract
data
is
what
we
call
replaying
the
transactions,
so
the
packs
and
their
indices
are
extracted
from
the
journal.
The
refs
are
updated
in
batch
through
ref
update
after
they're
pre-processed
to
reduce
the
unnecessary
number
of
transitions,
for
instance,
mutations
of
a
branch
that
are
later
deleted
turn
into
deletions
solely
and
after
this
they
basically
write
how
far
into
the
extents
logger
they've
gotten.
B
So
they
can
remember
next
time
we're
to
start
in
the
extents
log
and
to
appease
the
garbage
collector
and
to
prevent
this
proliferation
of
really
tiny
pack
files
present
each
individual
transaction.
We
invoke
another
process,
it's
like
a
repack
that
runs
in
the
background
and
combines
all
the
small
pack
files
into
a
larger
one.
This
kind
of
doesn't
play
well
with
the
heuristics
of
pack
files,
but
in
practice
people
end
up
doing
full
repacks,
often
enough
that
this
isn't
much
of
a
problem.
B
So
the
fact
is,
log
structured
gives
us
a
few
operational
benefits.
It's
really
cheap
to
serve
thousands
of
clients
off
one
machine
and
the
data
they
read
is
always
going
to
be
in
their
file
system
cache
pretty
much.
So
all
we're
doing
to
replay
to
give
them
the
data
they
want
is
sending
files
and
sending
bytes
with
real
wire
and
since
all
the
files
are
essentially
dead
on
the
disk.
There's
the
possibility
of
introducing
intermediate
HTTP
caches
that
understand
range
requests
between
the
client
list
server.
B
So
we
can
set
up
somewhat
dumb
proxies
in
far-off
places
that
yield
much
better
performance,
that
for
the
people
that
we're
trying
to
serve
changes
to
so
basically
without
these
journaled
fetches.
Our
repository
won't
fetch
at
all,
and
this
gives
us
a
much
more
predictable
runtime
and
is
often
faster
than
get
fetch.
B
B
Instead
of
closing
cloning,
usually
regular
git
clone,
we
have
people
fetch
initial
little
snapshots
through
BitTorrent
so
that
it
doesn't
take
so
long
to
actually
extract
all
this
stuff
for
some
users
fetching
and
replaying.
All
the
changes
might
be
overkill,
so
you
can
use
this
on
a
per
ref
basis.
You
might
have
this
set
up
so
that
master
always
gets
journals
and
then
the
rest
of
the
branches
are
fetched
from
a
different
differently
configured
remote,
where
the
journals
aren't
necessary.
B
So
you
don't
end
up
transferring
everybody's
branches
along
with
you
know
the
the
code
that
you're
actually
going
to
share
and
finally-
and
this
is
the
most
important
thing
that
people
say
is
that
it
doesn't
support
redaction.
So
if
people
push
stuff
that
shouldn't
be
there
like
private
keys
and
stuff,
we
don't
have
any
way
to
delete
it,
but
realistically
and
get
people
could
have
already
read
those
objects
and
have
them
on
their
local
machine
and
there's
no
way
to
actually
enforce
that
objects
are
deleted
on
clients.
B
B
B
We're
not
sure
the
stuff
is
right
for
everybody.
We've
never
really
wanted
to
maintain
a
fork.
We
kind
of
do
right
now,
but
if
any
of
these
optimizations
appeal
to
people
and
I
think
that
possibly
the
ref
mutation
optimizations
might
we'd
love
to
work
up
streaming.
Those
and
by
design
we've
tried
to
make
all
this
stuff
have
a
pretty
small
footprint
in
the
codebase.
B
We're
modifying
existing
programs
is
concerned
and
integrating
these
changes
should
be
pretty
easy.
All
these
optimal.
These
optimizations
are
totally
optional
and
their
can
configure
through
our
control
through
configuration
and
with
that
anybody
has
any
questions.
I
don't
have
that
much
time
left.
But
if
you
have
questions-
and
you
want
to
talk
to
me
afterwards-
that'll
be
fine.
B
I'm
gonna
I'm
gonna,
it
doesn't
any
of
it
does
any
of
it.
Make
sense
to
upstream
I.
Think
that,
specifically,
the
thing
where
we
send
a
bloom
filter
that
indicates
which
branches
need
to
come
down
is
a
huge
optimization
that
could
be
up
streamed.
The
alteration
monitor
is
a
change
that
we
tried
to
upstream,
but
other
people
were
working
on
that
at
the
same
time
and
get
without
any
dependency
outside
of
get
there's
a
huge
tendency
to
not
want
to
take
external
dependencies,
which
is
fine
but
yeah.
B
It
would
make
sense
to
upstream
a
few
of
these
changes.
The
journal
isn't
for
everybody,
it's
great
for
situations
where
you
have
a
huge
repository
and
everybody's
getting
the
same
thing
all
the
time,
which
is
the
case
that
we
had
but-
and
it's
also
great
for
mirroring
it's
much
faster
than
using
regular
fetch
mechanism
for
mirroring,
but
it
might
not
make
as
much
sense
for
everybody
I'd.
B
Imagine
that
that
can
be
separated
out
and
sent
as
a
contribution
or
as
a
separate
extension
to
get
such
as
what
happens
with
get
an
X
or
these
kinds
of
things,
but
it
might
never
have
a
place
in
actual
get,
and
it's
still
possible
to
achieve
these
things
in
in
the
scope
of
you
know
an
extension.
There
aren't
significant
modifications
necessary
to
make
this
happen.
The
only
real
modifications
are
to
the
tools
that
you
use
to
to
fetch
in
pole,
for
instance,
cool.