►
From YouTube: The Challenge of Monorepos: Strategies from git-core and Open Source, John Garcia - Git Merge 2015
Description
The problem of Monorepos, and related large files and locking issues, are the Achilles heel of Git DVCS. We lay down the specifications our imagined ideal large file extension for Git. We'll talk about locking and transmission strategies as well as our current work in the area.
A
And
the
next
person
that
I'd
like
to
introduce
works
for
a
great
company,
github
and
Atlassian
I
think
are
sort
of
you
know.
People
always
ask
me
how
we
get
along
with
that
lassie
in'
and
they're
the
best
we
actually
did
a
karaoke
contest,
not
too
long
ago
in
San
Francisco,
unfortunately,
github
lost
it
may
have
been
because
I
was
on
the
team.
If
you'd
like
to
see
a
demonstration
of
those
skills
later
on
tonight,
it
will
part
you
probably
couldn't
stop
me
from
doing
it,
so
it
may
happen.
So
this
is
a
this
guy's.
A
B
You
all
right,
thanks
for
coming
out
today,
I
hope
everyone's
enjoying
Paris.
This
is
a
beautiful
city.
Hopefully
you
have
time
to
get
around
and
see
some
of
the
great
sights
I
took
in
the
Eiffel
Tower
when
I
first
got
here,
and
that
was
outstanding,
but
yeah
I'd
like
to
begin
first,
was
introducing
my
talk.
I'd
like
to
talk
about
as
many
people
here
have,
the
challenge
of
mo
no
repos
or
dealing
with
large
objects
in
get.
B
This
is
a
problem
that
many
of
us
face,
I'd
like
to
introduce
myself
I'm
a
developer
philosopher,
which
means
that
I
I
look
at
what
the
humanities
and
the
arts
have
to
say
about
development
and
try
to
synthesize
that
into
a
philosophy
about
how
software
can
best
be
developed.
I
got
my
start
at
a
very
young
age
with
this
machine.
B
If
you
don't
recognize
that
this
is
the
TI
99
for
a
this
is
one
of
the
first
machines
you
could
just
throw
on
your
living
room
floor
and
plug
into
a
regular
television
and
do
whatever
you
want.
It
I
learned
ti-basic
with
it
fast
forward,
20
years
and
now
I'm
working
with
it.
Last
seen
at
bitbucket
and
we
are
a
large
D
BCS
provider,
a
lot
of
folks
use
this
as
the
source
of
truth
repo
for
their
team.
B
B
We
have
a
lot
of
products
that
work
in
sync
with
git
and
enable
teams,
both
large
and
small,
to
use,
get
to
its
best
its
best
capabilities,
so
I'd
like
to
talk
about
why
we
love
get,
and
that
has
a
lot
to
do
with
the
get
feature
set.
So
it's
important
to
to
look
at
data
integrity
as
a
large
part
of
the
feature
set
because,
as
each
file
is
handled
by
git,
it's
check
summed
into
a
hash
file
and
that
allows
us
to
efficiently
to
efficiently
make
good
choices
about
how
we
store
the
file.
B
It
has
an
advanced
branch
model
which
allows
distributed
work
among
teams
who
can
then
merge
their
work
later
without
the
necessity
of
scheduling,
work
and
it.
It
has
files
playing
and
chunking
capabilities
to
enhance
filesystem
performance
and
to
prevent
over
storing
too
many
files
in
the
same
folder.
B
So
getting
a
little
bit
more
into
depth
about
these
topics.
Data
integrity
when
get
operates
on
a
file
it'll
make
a
checksum
out
of
the
file,
and
if
it
recognizes
that
this
checksum
is
similar
to
a
checksum
of
another
file
that
it's
storing
it'll
know
that
it
only
has
to
store
that
file.
Once
another
thing
we
love
is
the
branch
model.
The
branch
model
enables
work
to
be
done
specifically
on
one
branch
and
to
be
compartmentalized
and
generally
allows
the
end
user
to
control
the
amount
of
data
that
they
share
with
the
server.
B
Thus
restricting
bandwidth
use
and
making
good
choices
with
your
resources,
and
then
we
can
ignore
other
branches,
as
needed.
I
also
like
to
talk
about
files
playing
files
playing
is
the
process
of
storing
the
files
in
the
git
repo
in
folders.
That
start
with
the
prefix
of
the
hash.
So
that
we
don't
end
up
with
a
folder
with
a
million
files
in
it,
we
can
have
256
folders
each
with
a
subset
of
the
repository.
B
B
This
is
a
work
of
art,
that's
print
made
by
humans,
it's
huge
and
it
has
no
place
in
your
gate
repository,
but
sometimes
that's
exactly
what
you
end
up
with,
and
so
the
we
looked
at
the
folks
that
are
reporting
these
issues
to
us
and
kind
of
took
a
look
at
the
taxonomy
of
what
the
objects
look
like.
So
for
us
generally,
our
problems
come
with
package
repositories
such
as
Bower
or
archives.
B
Other
other
use
cases
will
have
problems
with
either
audio
formats
or
video
formats
that
take
up
a
large
amount
of
space,
but
then
there's
also
scientific
computing
needs
such
as
MATLAB
and
Simulink,
and
the
like.
The
large
database
like
MongoDB
setups,
that
that
you
use
huge
amounts
of
structured
data.
B
So
over
time,
I
I
did
some
research
in
our
support
database
and
over
the
course
of
two
years
on
the
calendar.
We
saw
five
point
two
issues
per
week
related
to
large
repositories,
that's
more
than
one
every
business
day.
Quite
literally
every
day,
somebody
who's
calling
us
to
talk
about
their
large
repo
and
they're
sad
they're,
very
sad,
and
they
say
things
that
make
us
sad
sometimes,
namely
that
they're
going
to
leave
kit.
B
B
If
we
can
store
the
text
artifacts
in
the
get
repository
but
store
the
binary
artifacts
in
a
off-site
storage
solution,
such
as
s3
or
possibly
even
on
a
local
Drive,
if
that's
appropriate,
so
we've
been
doing
a
lot
of
research
and
we've
had
a
lot
of
thought
about
this
problem
and
I
want
to
show
you
guys
what
we
think
about
it,
and
maybe
a
little
bit
of
a
proof
of
concept
at
the
end.
But
I
want
to
be
really
clear
that
we
don't
intend
to
prescribe
where
this
conversation
goes.
B
We
really
just
want
to
put
the
results
of
our
research
out
for
the
community
to
see
to
to
hopefully
help
the
community
find
the
best
solution,
because
it's
really
and
I
think
everyone
here
can
agree.
It's
really
best
for
the
basic
fundamental
tools
that
we
use
to
be
tools
that
we
can
all
use
and
we
can
all
contribute
to.
So
in
our
research
we
identified
a
few
potential
areas
of
improvement,
first
of
which
is
cross-platform
support.
B
We
feel
that
any
solution
to
the
large
binary
problem
should
be
platform
agnostic
to
to
the
degree
that
it's
able
to
work
with
POSIX
and
non-plastic
systems,
and
also
work
with
a
variety
of
backends,
your
s3,
your
local
storage.
If
you
want
to
use
like
a
fuse
file
system,
we
feel
that
that
sort
of
interoperability
is
really
crucial.
B
B
Finally,
we
think
that
a
complete
solution
would
probably
address
the
question
of
vial,
locking
and,
of
course,
that's
a
very
difficult
question.
It's
a
certainly
contentious
topic
in
the
git
community
and
really
in
distributed
version
control
thinking
at
large,
because
that's
that
doesn't
fit
so
well
with
a
distributed
model
so
for
performance.
B
The
the
issue
that
came
up
for
us,
or
rather
the
our
observation,
was
that
existing
solutions
use
smudge
and
clean
filters
when
they
operate
on
files,
and
this
is
very
much
like
what
Rick
was
talking
about
a
moment
ago,
and
so
you
start
with
your
when
you
make
a
check
out
you
download
the
repository
and
in
if
you're,
using
a
out
an
external
storage
solution.
You'll
generally
have
files
that
have
references
to
the
external
objects
that
you
need
for
your
repository
and
then
on
checkout.
B
The
program
will
transform
that
into
the
actual
artifact,
then
on
commit
the
opposite
is
done.
The
artifact
is
distilled
into
a
checksum,
that's
put
into
a
text
file.
What
can
happen
is
that
if
you
have
a
huge
number
of
objects-
and
you
have
to
iterate
over
each
of
those
objects
in
sequence-
that's
a
lot
of
serial
operations
and
even
the
slightest
overhead
in
starting
your
process
will
add
up
over
time.
B
So,
let's
talk
about
portability
as
well.
We
feel
that
it's
critical
that
we,
that
a
great
solution
will
work
with
all
types
of
computers,
but
it
will
also
work
with
all
types
of
backends
it'll
work
with
cloud
storage,
local
storage.
Ideally
it
would
even
work
with
samba
or
or
any
sort
of
storage
that
you
have.
And
finally,
we
find
that
a
package
that's
easily
to
distribute.
B
Much
is
what
Rick
said
that
is
able
to
compile
a
static
binary
that
can
be
distributed
instead
of
dealing
with
runtime
dependencies
or
some
sort
of
installer
makes
a
much
more
compelling
case
for
uptake.
It
makes
it
a
lot
easier
for
people
to
package
it
for
users
to
download
it
and
install
it,
and
generally
it's
a
much
experience
all
around.
B
So
I
want
to
talk
about
file,
locking
because
sometimes
you
just
can't
merge.
As
we
can
see
here,
there
are
file
types
that
I
mean.
Of
course
everything
can
be
merged,
but
there
are
file
types
where
the
cost
of
merging.
If
you,
if,
if
human
reviewers
must
review
the
code,
the
cost
of
merging
is
prohibitive.
B
So
it's
it's
important
that
we
at
least
consider
what
we
might
do
to
help
proactively,
prevent
merge
conflicts
for
certain
file
types
and
so
we've
we've
kind
of
looked
at
what
kind
of
files
that
happens
with
and
it's
the
usual
culprits.
The
large
binaries,
such
as
your
your
audio
files
and
your
video
files.
But
in
addition,
some
of
the
larger
expressions
of
markdown
or
XML
or
even
auto-generated
files
by
unity
and
SolidWorks
can
be
very
difficult
to
merge
as
well.
B
For
starters,
you
download
your
repo
and
when
you
make
your
clone,
the
file
types
that
are
controlled
in
this
way
would
be
locked
for
modification,
and
so
because
of
the
the
processes
they've
been
going
through.
They
have
a
say,
a
serial
commit
graph
so
that,
even
though
each
commit
may
come
from
a
different
branch,
each
commit
is
an
atomic
expression
of
the
new
file.
B
So,
if
you
need
to
come,
if
you
need
to
check
out
this
file
or
you
need
to
unlock
the
file,
we
would
expect
that
you
would
download
the
newest
copy
of
the
file
and
then,
when
you
do,
we
release
the
lock
and
allow
you
to
make
changes
once
you've
made
changes
and
you're
ready
to
commit.
You
commit
that
to
the
newest
position
in
the
repo
and
becomes
the
new
position
of
the
file.
B
Of
course,
any
solution
like
that
that's
prescriptive,
should
follow
the
regular
get
rules
and
the
regular
get
rules
include
a
functional
force
option
to
allow
users
to
make
sensible
decisions
when
there's
unexpected
circumstances.
So,
for
instance,
if
somebody
locks
a
file
and
then
goes
on
vacation
or
leaves
the
organisation
users
need
to
be
able
to
say,
we
understand
that
there
is
cost
involved
to
allowing
a
concurrent
modification.
However,
we
are
willing
to
bear
bear
that
cost.
So
I
want
to
talk
a
little
bit
more
about
expanding
the
object
model
as
well.
B
Another
important
part
of
the
model
that
we
consider
is
local
object
retention,
which
is
to
say
that
it's
important
to
consider
which
files
you
would
want
in
which
files
you
would
not
want
to
keep
so
that
you
don't
fill
your
harddrive
entirely
with
files.
This
is
really
a
a
local
process
that
I'm
going
to
describe,
although
it
seems
like
it,
the
brave
at
heart
may
apply
something
to
the
server
end
as
well,
so
we
kind
of
break
up
the
commit
history
into
three
distinct
bands:
a
near-term
band,
a
midterm
band
and
a
long-term
band.
B
B
In
addition
in
the
midterm,
and
we
want
to
make
sure
that
we
keep
the
files
on
the
branch
that
we're
using
the
files
have
belonged
to
a
different
branch.
We
may
not
be
concerned
with
we.
We
think
we
that
you
vote
and
files
beyond
90
days,
we're
pretty
sure
nobody
is
concerned
with.
So
this
kind
of
provides
a
paradigm
for
garbage
collection
or
reclaiming
local
storage.
B
So
I'd
like
to
show
you
a
proof
of
concept
that
we've
been
kind
of
working
on
in
our
lab
and
give
you
an
idea
of
what
our
thinking
is
on
the
subject.
So
here,
I
am
creating
the
large
object
store
on
my
Raspberry
Pi
using
SSH
FS,
which
is
a
fuse
implementation,
creating
the
repository
in
bitbucket
in
knitting,
the
repository
on
my
local
machine
and
then
adding
and
committing
my
first
files,
I
used
my
Raspberry
Pi
to
point
out
that
this
works
really
elegantly
with
any
sort
of
back-end
storage.
We
really
a
core
design
features.
B
We
want
it
to
be
not
locked
into
anything
any
sort
of
service,
any
sort
of
server.
We
want
to
keep
it.
We
want
to
keep
it
as
simple
as
possible,
so
here
I'm,
adding
the
parameters
to
my
get
slash.
Config
and
I'm
gonna
do
my
initial
push
of
large
objects.
This
will
take
a
while,
but
I
fast
forwarded
it
for
your
viewing
pleasure.
B
So
we
see
there's
66
megabytes
to
upload
and
that's
now
done,
we
can
see
there's
now
a
file
on
the
PI,
so
we
copy
a
new
file
into
the
project,
commit
that
new
file,
and
this
uses
the
smudge
filter
to
clean
the
file
and
change
it
into
a
actual
repo
item.
And
now,
when
we
push,
we
see
that
we're
pushing
up
to
two
commits,
but
we
only
have
we
started
at
51%,
because
we've
already
pushed
the
first
file
that
belongs
to
this
commit.
B
Now
we've
pushed
the
binaries
through
the
orange
and
it's
time
for
me
to
open
the
file
and
do
some
work.
So
here's
a
previous
version
of
this
presentation
that
we're
watching
right
now-
and
it
came
to
me
that
this
would
be
a
great
slide
for
the
demo
rather
than
a
quote
so.
I'm
gonna
make
a
new
branch
for
the
video
store
that
then
also
store
the
artifacts.
B
B
So
now
that
that's
pushed
to
the
origin
I'm
going
to
make
a
new
clone
of
the
repo,
so
that
I
can
kind
of
simulate
what
it
would
be
like
if
my
coworker
Nicola
were
to
work
on
this,
so
he
pulls
it
up.
He
adds,
in
his
configuration,
saves
the
modified
files,
pull
the
binaries
and,
of
course,
this
takes
some
time.
B
He's
got
to
download
the
entire
set
of
binaries
because
he's
using
he
does
not
have
access
or
he
does
not
have
the
objects
that
I've
stored
yet,
but
once
he's
downloaded
those
he's
ready
to
start
working
so
he'll
open
the
file.
Oh
wait!
No,
that's
the
wrong
one!
That's
the
wrong
branch!
Let's
change
branches
notice!
Changing
branches
took
just
a
moment.
B
So
now
we're
on
this
new
branch.
We
want
to
put
a
placeholder
in
for
where
the
videos
video
scene
will
go
and
then
we'll
save
it
we'll
commit
this
and
then
we'll
push
this
up
to
push
the
repo
up
to
the
origin
and
push
the
large
objects
up
to
up
to
its
storage
will
notice
again.
This
starts
at
75%
because
for
three
out
of
the
four
objects
are
already
found
at
the
storage
location.
B
All
right
now
that
that's
complete
I
want
to
look
at
the
large
object,
store
and
see.
What's
on
my
pie,
yeah,
it
looks
like
I
have
four
folders
now
one
for
each
of
the
objects.
Those
files
in
the
stores
are
chunked,
which
is
to
say
that
they
are
broken
into
smaller
pieces
so
that
if
a
part
of
the
file
doesn't
actually
make
it,
it
can
be
uploaded
separately
and
as
just
a
chunk
in
the
event
that
there
is
some
sort
of
a
connection
interruption.