►
From YouTube: Data Together Presentation at PASIG Autumn 2017
Description
Presentation about applying decentralized patterns to the activities of Libraries, Archives and Museums. Given at PASIG, a conference of digital preservation specialists. There is a less technical version of this talk, which goes deeper into the economic context, in the recordings from the 2017 NDSR symposium at the World Bank: https://archive.org/details/ndsr-dc-2017
A
Hi
this
is
Matt
Zumwalt,
and
this
is
a
recording
of
the
presentation.
I
gave
at
pay
sake
at
Oxford
University
in
autumn
2017.
This
was
a
meeting
of
digital
preservation,
specialists
from
libraries
and
archives
around
the
world.
I
work
at
protocol
labs,
which
is
a
research
and
development
and
deployment
lab
for
networks.
Our
main
projects
are
ipfs
the
interplanetary
file
system
and
file
claim,
which
you
may
have
heard
about
in
the
news
recently.
A
A
It
shows
a
client-server
architecture
where,
if
you
imagine
each
of
the
dots
in
each
of
these
diagrams
as
being
a
machine
or
a
node
on
the
network,
in
your
client-server
structure,
you
have
one
server
that
serves
many
clients
and
those
clients
are
unable
to
connect
to
each
other.
They
are
only
able
to
connect
to
the
server
in
the
middle
diagram
which
this
paper
called
decentralized.
A
You
have
a
federated
architecture
where
some
machines
are
able
to
connect
with
each
other,
but
then
the
other
client
nodes
are
only
able
to
connect
to
those
super
nodes
and
they
are
not
able
to
connect
directly
with
each
other
and
then
over
on
the
right.
You
have
a
peer-to-peer
architecture
where
any
node
is
able
to
connect
with
any
other
node,
but
I
was
also
able
to
rely
on
other
nodes
as
a
relay
when
passing
information
between
them
and
other
nodes
on
the
network.
A
Instead,
it's
forced
to
exist
in
some
pre-configured
system,
that's
controlled
by
other
parties
and
as
as
my
work
progressed
in
helping
libraries
and
archives
to
build
their
own
systems,
I
became
convinced
that
the
systems
we
were
creating
was
still
we're
still
participating
in
that
same
system.
Despite
the
best
intentions,
we
were
just
building
another
form
of
cage,
rather
than
helping
the
patrons
of
those
libraries
and
Archives
to
thrive
with
their
data
existing
in
the
wild
and
so
I
became
fairly
strongly
convinced
that
the
solution
is
to
rebuild
the
web.
A
Now,
let's
look
at
an
example
in
response
to
the
election
in
2016
people
became
concerned
that
the
new
federal
administration
would
start
making
climate
data
less
available,
either
by
turning
off
servers
or
obstructing
availability
in
some
other
way,
and
so
there
was
an
up
swelling
of
effort
of
people
attempting
to
move
that
data
to
other
safer
locations
from
a
preservation
perspective.
It's
important
to
think
about
what
this
means.
This
was
thousands
of
people
spending
days,
weeks
or
months
of
their
lives,
attempting
to
preserve
other
people's
data.
A
Now,
what
happened
in
this
effort
was
that
there
were
some
unintended
consequences
now.
The
thing
to
keep
in
mind
is
that
these
people
were
attempting
to
help
those
federal
agencies.
These
are
people
who
care
about
the
EPA
and
they
care
about
the
data
produced
by
the
EPA,
but
when
they
made
replicas
of
the
data,
then
that
data
became
something
that
was
in
competition
with
the
original.
A
By
switching
to
this
alternative
pattern
on
the
web,
it
brings
up
a
new
kind
of
conversation
where,
rather
than
being
a
confident
competition
to
define
which
locations
we
should
rely
on
to
retrieve
content.
Instead,
it
becomes
a
conversation
about
what
data
are
at
risk.
What
are
the
ways
that
they're
at
risk?
Why
are
those
data
valuable
who
are
they
valuable
to
and
who
should
be
holding
copies
of
it?
A
What
this
does
in
an
underlying
way
is
it
turns
access,
discovery
and
preservation
into
participatory
activities,
and
they
become
conversations
between
people
who
care
about
data
and
the
institutions
and
community
organizations
that
represent
them
or
aggregate
resources
from
those
communities.
So
if
we
express
this
as
a
simple
principle,
it's
that
information
should
be
possessed
by
the
people
who
rely
on
it
and
that
communities
and
institutions
should
aggregate
information
and
resources
to
support
access
discovery
and
preservation.
A
Now,
from
a
from
our
libraries
and
archives
perspective,
there
shouldn't
be
anything
revolutionary
about
these
ideas,
they're,
simple
and
straightforward,
and
that's
because
libraries
and
archives
were
created
as
a
means
of
sharing
possession
of
information
resources.
But
let's
look
at
how
the
current
structure
of
the
web
stands
as
an
obstacle
to
achieving
this,
because
the
current
the
current
structure
of
the
web
uses
an
approach
of
location,
addressing
we
identify
content
by
its
location
rather
than
content,
addressing
where
you
identify
and
identify
information
by
its
content.
A
So
for
a
simple
example,
if
you
think
about
how
we
identify
a
book
when
I
recommend
a
book
to
you,
I
recommend
it
based
on.
You
should
read
a
book
by
the
with
this
title
by
this
author.
I
might
tell
you
when
it
was
published
or
what
the
publisher
was,
and
you
can
go
and
find
any
copy
of
that
book.
That
matches
that
description
of
its
content
and
read
the
book
being
confident
that
you're
reading
the
book
I
recommend
it.
A
Now,
if
we
start
using
links
to
those
other
locations,
each
copy
of
the
content
becomes
a
distinct
thing
in
a
distinct
location
with
its
own
identifier,
which
has
its
own
profile
of
who
relies
on
it
and
points
to
it.
And
so
that's,
where
you've
created
this
situation,
where
even
if
you
were
replicating
the
content
in
an
attempt
to
reinforce
availability,
you're,
actually
undermining
the
original
copy,
because
you're
competing
with
it
to
be
the
location.
A
Now,
the
alternative
to
this
is
to
use
a
content
addressed
approach
with
digital
information,
so
the
key
benefit
with
content.
Addressing
is
that,
if
anyone
on
the
network
has
a
copy
of
the
content,
you'll
be
able
to
find
it
and
retrieve
it,
and
the
key
idea
here
is
that
location
doesn't
matter.
What
matters
is
that
you
are
able
to
get
the
content
that
you
requested
and
to
be
confident
that
you
are
retrieving
the
content
that
you
want
it
and
the
way
we
achieve
this
is
by
using
cryptographic
hashing
now
for
a
digital
preservation
community.
A
This
should
be
a
familiar
concept,
because
we
use
this
pattern
with
fixity
checking
where
you
can
put
any
content
into
it
through
a
cryptographic,
hashing
algorithm,
and
that
algorithm
generates
a
unique
string
of
letters
and
numbers
that
represents
the
content.
If
you
put
exactly
the
same
content
through
that
algorithm,
you
will
always
get
the
same
hash.
A
As
a
result,
you
will
get
the
same
string
of
letters
and
numbers
as
a
result,
but
if
you
put
content
that
has
changed
even
a
bit,
even
a
single
bit
within
that
content,
if
it
has
changed,
you
will
get
a
different
hash.
So
what
this
means
is
that
those
hashes
are
universally
unique
identifiers
for
precisely
that
content
with
a
content
addressed
approach,
we
use
those
hashes
as
the
identifier
for
the
content
and
what
this
achieves
is
that
benefit
or
if
anyone
on
the
network
has
a
copy
of
that
content.
A
Let's
look
at
how
this
affects
patterns
like
accession
of
data,
where,
if
I
have
content
that
I've
been
sharing
with
my
peers
using
a
content
addressed
approach
at
any
point,
I
can
share
that
identifier
with
the
library
and
the
library
could
accession
that
content
I
would
never
have
to
upload
the
content.
All
I
have
to
do
is
pass
the
identifier
to
the
library
and
they
can
retrieve
the
content
from
the
web
on
their
own.
A
But
you
can
do
this
in
a
more
robust
way
where
I
might
submit
metadata
about
my
content
or
I
might
submit
information
metadata
like
a
version
history,
so
you
could
have
a
series
of
versions
of
a
data
set
that
I
think
are
important
for
the
library
to
hold
in
their
collections.
All
I
need
to
do
is
submit
the
hash
of
that
versioned
history
of
the
content,
so
I'm
submitting
the
hash
of
the
metadata
and
the
library
can
pull
in
the
whole
corpus
of
information,
including
the
metadata
as
they
see
fit.
A
Underlying
this
pattern,
the
thing
that
makes
this
powerful
is
that
you're
using
hash
linked
data
structures,
so
this
is
a
this
is
a
pattern
that
we've
already
seen
impact
in
areas
of
our
industry
or
areas
of
the
technology
industry
through
sit
through
technologies
like
get
an
apache
spark
or
Bitcoin
and
BitTorrent
all
of
these
technologies
underlying
them.
They
are
using
the
pattern
of
hash
linked
to
data
structures.
So
what
are
the
benefits
here?
Why
are
we
using
hash
linked
data
structures?
A
A
Underlying
that
is
this
notion
of
cryptographic,
integrity
checking.
You
can
use
the
the
link
value
itself.
You
can
use
that
hash
to
validate
the
content
that
you
got,
and
so
this
is
a
powerful
tool
for
ensuring
the
integrity
of
entire
systems
of
data
over
time
and
as
their
term
it
transmitted
over
the
web.
A
Now,
the
most
fundamental
thing:
that's,
making
hash
link
data
structures
powerful.
Is
that
what
you're
doing
is
creating
immutable
data
structures?
So
this
notion
of
immutable
data
structures
is
prominent
in
the
field
of
field
of
computer
science.
It's
been
around
since
the
advent
of
computer
languages
or
programming
languages,
and
the
main
idea
here
is
that
you
can
create
data
structures
that
do
not
mutate
and
whenever
you're
creating
a
new
version
of
data.
You
have
a
new
identifier
for
that
new
version.
A
A
So,
if
I
have
a
file
or
a
whole
whole
corpus
of
files
on
my
system
or
a
corpus
of
data
on
my
system,
I
can
tell
ipfs
to
add
that
content
to
its
repository
and
ipfs
will
return
the
cryptographic
hash
that
identifies
that
content.
I
can
then
use
that
hash
to
request
the
content
through
any
ipfs
node
anywhere
on
the
network,
and
that
node
will
be
able
to
use
that
hash
to
retrieve
the
content.
Validate
it
and
return
to
me.
A
I
PFS
nodes
are
also
backwards
compatible
with
the
HTTP
web,
the
regular
web,
as
we
know
it
today,
where
you
can
use
these
hashes
to
ask
any
IP
FS
node
to
retrieve
content
for
you.
This
allows
this
allows
tools
like
web
browsers
to
use
HTTP
to
retrieve
content.
That's
actually
stored
on
the
peer-to-peer
web
now
to
look
at
a
concrete
example.
The
text
is
a
bit
smaller
here,
but
this
is
a
real
hash
for
a
snapshot
of
the
English
version
of
Wikipedia.
A
What
this
opens
up
is
the
possibility
of
a
notion
of
pinning
where,
if
I
have
content
on
my
machine,
and
you
want
to
hold
a
copy
of
that
content
on
to
your
machine.
What
you
do
is
you
tell
your
IP,
FS
node,
to
pin
that
hash
on
to
that
machine,
and
what
that
tells
the
node
to
do
is
to
retrieve
the
content
corresponding
with
that
hash
and
to
hold
onto
it
on
that
machine
until
or
unless
you
unpin
it
and
run
some
form
of
garbage
collection
to
clean
up
after.
A
So,
let's
look
at
how
this
impacts,
accessioning
access
discovery
and
preservation,
such
as
the
activities
that
libraries
and
archives
engage
in
all
the
time.
The
main
idea
here
is
that
we
can
use
sets
of
pinned
hashes
and
treat
them
as
collections.
This
gives
us
content
addressed
peer
to
peer
collections,
since
this
group
is
focused
on
preservation,
let's
start
there
so
the
first
example.
The
first
benefit
is
that
replication
becomes
easy
and
transparent.
That,
in
and
of
itself,
is
extremely
useful
in
a
preservation
context.
A
When
you're
moving
data
between
different
storage
devices
or
different
storage
contexts,
you
have
the
ability
to
replicate
and
then
to
validate
the
results
very
easily,
but
there's
the
other
powerful
pattern
that
a
downloaded
content
copy
of
the
content,
possibly
downloaded
by
one
of
your
patrons
if
they
downloaded
it.
Using
a
pattern
like
ipfs
or
using
a
content
addressed
approach.
Now
that
downloaded
copy
is
a
valid
replica
that
you
can
cryptographically
validate
at
any
point
which
opens
up
the
door
to
a
pattern
of
participatory
preservation.
A
This
is
enabled
by
the
ability
to
do
integrity
checking
automatically
on
the
content,
but
there
are
also
some
things
some
ways
in
which
this
benefits
just
the
day-to-day
activities
of
preservation,
such
as
format,
migrations
and
versioning
of
content.
Those
things
become
much
easier
when
you're
using
a
content
addressed
approach.
A
This
also
impacts
accession
in
and
the
main
way
that
this
impacts.
Accession
of
content
is
that
it
lowers
the
barriers
to
entry
or
the
lowers
the
barriers
to
accession
of
content.
In
the
first
place,
it
anticipates
the
next
generation
of
tools.
The
people
who
create
and
share
data
in
their
daily
work
are
going
to
be
using
tools
like
git
tools
like
Apache
spark,
which
are
already
using
content,
addressed
approaches.
A
A
It
also
impacts
discovery
in
really
interesting
ways.
So,
in
one
sense,
you
can
think
of
it
as
the
metadata
that
we're
tracking
about
our
collections
is
the
data
set
in
and
of
itself,
which
is
a
data
set.
You
could
publish
over
the
decentralized
web
that
anyone
could
fork
that
collection.
They
could
perform
an
machine
analysis
on
that
collection
or
metadata
in
Richmond,
and
they
could
submit
the
results
which
could
be
validated
and
then
potentially
integrated
into
the
main,
the
main
version
or
official
version
that
an
institution
maintains
or
community
maintains.
A
It
also
provides
a
powerful
means
of
deduplication
of
content
and
deduplication
of
effort
and
deduplication
of
metadata.
So,
for
example,
if
I
already
have
metadata
for
a
particular
piece
of
information,
you
could
request
the
metadata
and
match
it
against
the
content,
using
a
hash
of
the
content
itself.
A
If
you
find
these
ideas
interesting
or
you'd
like
to
have
or
you'd
like
to
participate
in
the
conversation
about
where
these
technologies
might
go,
we've
been
using
the
heading
data
together
as
a
placeholder.
For
those
conversations,
a
number
of
organizations
are
already
involved
and
we'd
love
to
hear
your
voice
as
well.
We're
not
only
interested
in
seeing
how
we
can
apply
this
with
the
data
of
the
web
as
we
know
it
today,
but
we're
also
interested
in
looking
at
how
we
can
apply
these
tap
apply
these
patterns
to
the
next
round
of
data.
A
That's
arriving
on
the
internet,
such
as
the
Internet
of
Things
and
the
data
that's
being
produced
for
mixed
reality,
augmented
reality
and
virtual
reality,
which
will
play
such
a
central
role
in
the
way
that
people
encounter
information
over
the
coming
decades.
I
hope
you
found
these
ideas
interesting
and
inspiring
and
I
look
forward
to
hearing
from
you.