►
From YouTube: Keynote & Overview Survey - David Aronchik, Wes Floyd
Description
This talk was given at IPFS Camp 2022 in Lisbon, Portugal.
A
A
So
what
does
that
mean?
Well,
first
I'd
like
to
do
a
quick
survey
on
exactly
what
we're
talking
about
when
we
talk
about
Big
Data,
the
amount
of
data
on
the
web
is
obviously
just
shooting
up
like
mad.
You
can
see
here.
The
number
of
things
that
happen
every
minute
on
the
web
is
enormous.
This
is
something
that
Juan
alluded
to
earlier.
You
know
at
in
web
3
and
distributed
science.
Excuse
me
distributed
compute
we're
going
to
have
to
deal
with
these
levels
of
numbers.
A
A
That's
such
an
enormous
number.
It's
like
almost
impossible
to
believe
it's
literally
billions
of
times
larger
than
than
the
hard
drives
that
we
have
today,
and
you
might
think
to
yourself
that
you
don't
have
a
big
data
problem
and
that
may
be
true
today,
but
it's
actually
remarkably
easy
to
get
to
Big
Data.
So,
let's
put
a
marker
on
the
board
and
call
it
a
hundred
terabytes.
How
would
you
build
a
hundred
terabyte
of
data
today
right?
A
Hundreds
of
terabytes
is
much
much
larger
than
most
of
our
hard
drives
and
computers,
and
things
like
that
there
it
becomes
hard
enough
for
people
to
manage.
Well,
you
could
start
with
just
having
a
thousand
nodes
across
your
overall
deployment
producing
a
gigabytes
of
log
a
day.
Maybe
it's
a
hundred
VMS
Each
of
which
have
10
services
on
them.
Each
of
those
produce
100
megabytes
a
day,
maybe
you're
doing
video
collection.
A
Video
streaming,
you
have
uploads
20,
000
users,
five
videos,
five
minutes,
long
that'll
get
you
to
a
hundred
terabytes
fleets
of
vehicles
or
Edge
devices
thousand
Vehicles.
The
average
vehicle
today
has
70
sensors.
That's
probably
going
to
go
up
in
time
and
each
of
those
produce
150
megabytes
a
day,
that's
100,
terabytes
or
maybe
you're
doing
something
really
broad
you're
collecting
from
millions
of
iot
devices
all
across
the
world,
each
of
those
only
producing
one
megabyte
a
day
and
you're
going
to
hit
it.
A
These
are
these
numbers
are
very
realistic
today,
and
each
of
them
will
get
you
to
100
terabytes
and-
and
you
might
wonder
why
I'm
talking
about
100
terabytes,
you
know
even
at
extremely
fast
bandwidth,
with
no
interruptions
moving
100
terabytes
from
one
place
to
another
at
10
gigabits
takes
an
entire
day,
meaning
if
you're
generating
that
on
a
daily
basis
and
you're
moving
it
from
place
to
place.
If
you
have
any
Interruption
whatsoever,
any
compute
necessary
you're
going
to
lose
you're
going
to
fall
behind
immediately.
A
We
put
together
like
a
quick
summary
here
on
on
what
it
costs
to
storage
yourself.
If
you're
going
to
do
it
you're
going
to
build
your
own
on-prem
data
center,
what
hard
drives
are
cheap
bad.
Could
it
be?
You
know
the
cost
of
doing
this
over
a
five-year
period
to
store
one
petabyte.
So
that's
just
10
100
terabytes
is
1.3
million
dollars
and
you
say
well,
hard
drives
are
cheap,
maybe
they're
coming
down
system.
Hardware
only
makes
a
small
percentage
of
this
at
five
hundred
thousand
dollars.
A
The
rest
of
it
is
maintenance,
ongoing
facilities,
all
that
kind
of
stuff.
So
it's
really
going
to
be
quite
costly
to
store
it
yourself.
So
I
would
argue
that
compute
over
data
isn't
just
recommended.
You
know
it's
the
law
today.
You
have
a
bunch
of
options
for
doing
compute
over
data
for
actually
pushing
your
compute
to
where
it
is
and
they're
pretty
good,
to
say
the
least
right.
These
are
huge
platforms.
Hundreds
of
millions
billions
of
dollars
invested
billions
of
dollars
a
market
cap
here.
A
But
the
truth
is
that
they're
really
like
built
for
a
very
particular
set
of
needs-
they're
not
really
built
for
our
new
decentralized
world
or
by
the
way
I
would
argue
for
data
scientists.
So
if
you
look
at
the
you
know
most
canonical
example
of
doing
data
science
data
analysis
today,
it
will
look
like
something
like
this.
This
is
the
you
know,
Panda
example
for
building.
You
know
just
doing
an
analysis
across
a
small
set
of
houses
in
a
particular
region
to
do
the
exact
same
thing
in
Hadoop.
A
The
problem
is
that
you
have
to
take
what
people
know
and
love
today,
python
Jupiter
so
on
and
so
forth,
and
convert
it
into
the
platform
for
that
to
run
in
this
distributed
way,
and
this
isn't
even
the
whole
thing.
We
haven't
even
gotten
to
scheduling
orchestration
other
things
like
that,
let
alone
maintenance
of
your
overall
cluster.
Now
again,
I
don't
want
to
dismiss
those
platforms,
they're
built
for
a
very
specific
reason,
but
is
that
the
need
for
everything-
let's
say
I,
just
want
to
filter
my
data
at
the
edge?
A
Let's
say
I
want
to
do
some
trivial
transforms
or
something
more
profound,
like
you
know,
build
reproducible
events
that
might
not
be
appropriate
for
a
centralized
centrally
author
authorized
and
maintained
cluster.
A
So
now,
you're,
saying:
okay,
well,
you're,
making
a
case
that
maybe
everything
shouldn't
be
centralized
in
a
single
place,
but
why
does
it
have
to
be
truly
decentralized
operating
in
a
trustless
environment?
The
funny
part
is
I
was
trying
to
like
come
up
with
a
good
answer
for
this
I
actually
just
went
to
the
dictionary
right.
The
definition
of
decentralization
is
organizations
whose
activities
are
not
performed
in
one
central
place
but
happen
in
many
different
places.
A
There
you
go,
I
didn't
even
have
to
say
it
right,
we'll
talk
about
going
back
to
those
things,
those
hundred
terabytes.
You
have
many
machines,
many
devices,
many
users,
all
of
them,
spread
all
over
the
world.
They're
not
sitting
in
your
data
center
right,
they're,
all
over
the
world,
they're
already
decentralized.
Your
data
is
already
decentralized,
whether
you
want
it
to
be
or
not,
and
the
problem
here
is
what
you
see.
What
you
see
in
front
of
you
so
I'm
going
to
give
you
a
walk
you
through
kind
of
a
trivial
example.
A
You
have
your
data,
you
have
your
data
scientist
in
this
centralized
example
with
her
centralized
compute
and
then
we're
going
to
make
it
easy
here
and
just
have
kind
of
three
data
centers.
She
says
I'd,
like
one
data
processing
job
the
moment.
She
makes
that
request.
She
should
go
out
to
every
everesters
and
move
the
data
that
was
sitting
there
at
those
edges
into
the
central
machine
and
God
help
you.
If
you
have
bandwidth
throttling
at
that
Center
place
where
maybe
it's
only
10
gigabyte.
A
Now
you
have
to
do
it
serially,
instead
of
doing
it
in
parallel.
So
then
it
gets
even
worse.
Now
again
everyone
kind
of
faces.
This
takes
a
long
time.
You
know,
maybe
it
takes
an
entire
day
if
you're
you're
dealing
with
100
terabytes,
but
at
the
end
of
it
it
runs
the
compute
job
and
it
does
hand
her
back
and
she
says:
okay,
great
ready
to
start.
A
She
runs
it.
It
gives
her
her
results
and
she
says
Oh
no
I
got
one
thing
wrong
or
I
have
to
rerun
it
or
any
of
a
number
of
different
questions
if
it's
even
a
remotely
long
time
an
hour
a
day,
but
since
she
ran
her
job
and
finally
did
her
analysis,
it's
highly
likely
that
very
expensive
data
cash
would
have
been
removed
and
you
would
have
to
start
all
over
again.
So
not
so
good,
so
you
might
say
or
anything's
better
on.
You
know,
hyperscale
Cloud
well
sort
of
make
no
mistake.
A
They
are
very,
very
scalable.
They
have
some
degree
of
effective
decentralization
in
that
they
have
many
data
centers
all
over
the
world.
But
the
truth
of
the
matter
is
your
users
aren't
sitting
in
those
data.
Centers
they're
still
going
to
be
decentralized,
whether
you
want
them
to
be
or
not,
and
we
really
haven't
again
even
gotten
to
the
cost.
A
That's
one
of
the
reasons
by
the
way
why,
after
you
upload
your
data
to
these
clusters,
they
tear
them
down
and
delete
the
data
because
it's
so
expensive
to
just
maintain
them.
So
I
would
like
to
propose
that
we
need
a
system
that
maps
to
how
we
collect
and
store
data
today,
and
what
does
that
look
like
well
for
us?
It's
the
center
of
the
Venn
diagram.
A
A
What
you're
going
to
see
examples
of
today,
your
data
scientist
needs
to
declare
her
job
and
the
pipeline
necessary
to
run
it
so
we're
gonna
have
to
go
and
build
a
reproducible-ish
job
and
pipeline
environment.
True
reproducibility
is
very
hard
you're
going
to
see
some
great
platforms
today
that
get
close,
but
we're
certainly
not
done.
Second,
you
need
to
start
those
jobs.
Now
that
you
have
it,
you
have
to
start
it
and
start
it
at
a
distance
where
the
data
is
being
executed,
which
means
we
need
decentralized
and
orchestrated
execution.
A
I
need
to
be
able
to
spread
out
automatically
to
a
number
of
machines
and
execute
those
jobs
in
each
place
and,
finally,
I
need
to
ensure
my
job
finishes
and
I
can
trust
the
results,
so
we're
going
to
need
to
build
an
incentivized,
consistent
and
verifiable
Network,
and
it
can't
require
rewriting
everything.
We
can't
go
back
to
Hadoop,
let's
not
make
the
same
mistakes
that
were
made
in
the
past.
Let's
meet
folks
where
they
are
so
in
your
new
world.
It
looks
something
like
this:
you
have
your
data
scientist
again.
A
She
says
I'd
like
to
do
one
compute
processing
job.
She
hands
it
to
the
network.
The
network
says
all
right,
like
I
have
a
CID,
it
has
three
chunks,
who's
got
it.
The
first
note
says:
I
have
a
chunk
I'll.
Take
that
the
second
node
says
the
same
thing:
I'm
good
ready
to
go.
Then
we
have
a
problem
right.
A
The
third
node
says:
oh
well,
I
have
the
chunk,
but
I
don't
have
any
CPU
space
I'm
already
doing
something
else,
and
then
another
node
says:
oh
you
know,
I
have
I
have
CPU
space,
but
I
don't
have
a
chunk.
So
let's
get
them
to
work
together
right.
Let's
have
the
network
automatically
move
it
from
point
A
to
point
B.
A
Now,
obviously
I
mentioned
before
that
you,
you
want
to
go
out,
train
a
void,
moving
data
where
possible,
but
that
may
not
be
the
case
and
it's
up
to
you
as
the
job
definer
to
spec.
Oh
I,
never
want
to
move
this
job.
That's
all
fine!
Okay,
maybe
we'll
wait
until
the
first
node
is
done
or
let's
say
you
know
what
I
do
want
it
in
a
hurry:
I'm
ready
to
pay
that
you
know
overall
cost
or
time
or
whatever
it
might
be,
so
they
run
it.
A
The
node
gets
the
job
and
then
they're
able
to
run
it
again.
We
can't
get
processing
for
free,
but
you
are
executing
in
parallel
by
default
and
then
they
say:
okay,
we're
done,
but
we're
not.
We
are
not
done.
The
the
computer
overdated
network
isn't
done.
We
need
to
verify
it
so
then
another
node
that
volunteers
itself
and
says
I
want
to
verify
that
those
jobs
were
completed.
Correct
correctly,
it
automatically
goes
through
that
process
takes
a
few
minutes
and
then
it
says
the
job
is
verified,
download
at
your
leisure,
and
she
can
do
so.
A
So
what
we're
saying
here
the
case
that
I
hope
I'm
making
in
the
landscape
for
the
overall
compute
over
data
environment
is
what
you
see
here.
We
think
we
were
proposing
that
it's
easier
to
manage
by
providing
you
a
self-organizing
network.
This
is
something
where
all
the
nodes
understand.
Each
other
know
how
to
process
know
how
to
like,
maintain
restart,
move
things
around
they're
compatible
with
existing
tools.
You
can
take
your
Docker
container.
A
You
can
take
your
awesome,
you
can
take
whatever
it
might
be
and
move
it
to
where
the
data
is,
and
it's
got
built-in
verification.
So,
within
that
you
can
use
a
number
of
different
verification
techniques.
You
can
provide
arbitrary
Lambda
functions
you
can
provide,
you
know
we
hope
soon,
snarks
and
other
things
like
that
to
provide
execution
and
verification
at
the
edge
it's
more
cost.
A
Effective
I
already
mentioned
it's
effectively,
serverless
like
those
nodes,
the
data
and
so
on
already
take
advantage
of
powerful
networks
like
ipfs
to
maintain
itself
as
things
come
up
as
things
go
down,
reproducibility
automatically
kicks
in
and
when
nodes
need
to
move
things
around,
you
can
get
efficient
bin
packing.
This
thing
has
CPU
over
here.
A
This
thing
has
storage
over
here,
let's
figure
out
how
to
bring
those
together
and,
of
course,
it's
impossible
to
overlook
greatly
reduced
Ingress
egress
you're
able
to
take
advantage
of
the
fact
that
you're
executing
right
next
to
the
data
you
have
a
file
handle
instead
of
a
network
port
to
go
and
get
your
data
from.
Finally,
it's
reproducible
and
collaborative,
obviously
everything
you
see
here-
is
going
to
be
content.
A
Addressable
content,
addressable,
Bob,
hash
content
addressable
by
a
Merkle
tree
you're,
going
to
be
able
to
do
everything
you
need
and
know
that
you're
not
reproducing
things
that
are
already
out
there
in
the
world.
You
also
will
get
metadata
and
lineage
for
free.
These
networks
will
be
able
to
provide
things
on
the
chain
to
that
have
proof
of
exactly
what
happened
when,
which
is
something
that
none
of
these
platforms
to
provide
today
natively,
they
all
have
to
go
out
somewhere
else
and
wire
in
that
metadata
and
lineage
themselves.
A
Without
that
you
run
into
real
problems.
As
you
begin
to
get
down
your
pipeline.
As
you
begin
to
get
derivative
data
set
after
derivative
data
set,
you
need
to
walk
all
the
way
back.
Where
did
I
collect
this?
What
did
it
come
from
so
on
and
so
forth?
That's
on
us
to
build
that
into
the
platform
by
default,
and,
finally,
you
know
I
think
you
can
really
get
truly
Innovative
models
today.
You
know,
maybe
you
have
an
S3
bucket,
or
you
know,
by
the
generosity
of
a
hyperscale
cloud.
A
They'll
support
an
open
data
set
and
that's
nice
that's
greater
than,
and
we
don't
obviously
want
to
turn
our
nose
up,
but
truly,
if
you're
doing
this
properly,
you'll
have
ways
for
communities
to
join
together
and
each
contribute.
They
may
contribute
money.
They
may
contribute
their
own
compute,
they
may
contribute
storage.
They
may
contribute
working
time
on
these
things
and
provide
overall
models
and
and
new
ways
of
processing
the
data
in
a
decentralized
world
that
becomes
much
more
possible
without
sharing
a
single.
A
You
know,
API
credential,
now
we're
working
on
this
right
now
we
have
a
very
passionate
group
of
people.
This
is
the
computer
over
data
working
group.
Already
15
members
representing
you,
know,
75
people
who
we
meet
every
other
week
and
we're
already
beginning
to
work
on
many
of
these
things,
whether
or
not
their
standards
or
you
know
collaborations,
between
these
various
organizations.
A
We
all
care
about
this
right
now
and
you
might
ask
okay,
you
just
showed
me
15.,
which
one
would
you
like
me
to
pick
and
the
truth
is
there
isn't
one
right
you
might
have
seen
this
earlier
Juan
presented
this
and
we're
we're
very
inspired
by
this,
because
this
is
the
true
reality
right.
You
have
a
three-tier
system
right
now,
where
you're.
Basically
it's
up
to
you
to
decide
what
you
want.
A
You're
gonna
have
privacy
on
one
axis:
you're
gonna
have
verifiability
on
another
axis
and
you
have
a
performance
on
a
third
access
and
it's
up
to
you
to
decide
which
of
these
are
the
fit
for
your
situation
and
in
truth,
this
is
again
something
that
really
differs
from
the
excuse
me
from
the
systems
that
you
have
today
today.
A
It's
really
a
one
size
fit
all
if
you
went
out
and
spun
up
a
spark
cluster
or
Hadoop
cluster
or
EMR
or
take
your
pick,
it's
a
great
platform,
but
it's
making
a
bunch
of
decisions
for
you
and
it's
asking
you
to
say
like
okay,
here's
where
you
are
you're
done.
What
we
think
will
happen
is
that
people
will
pick
and
choose
even
within
a
single
organization
and
say:
oh,
this
is
HIPAA
data
I
need
it
to
be
fhe,
or
this
is
actually
just
log
data
I'm
totally.
A
Okay,
I
want
this
to
be
very
performant
but
I.
Don't
you
know
I,
don't
care
that
much
about
verifiability
things
like
that.
We
want
people
to
be
able
to
pick
and
choose
these
as
l2s
on
top
of
it
on
top
of
a
common
storage
and
network
solution,
and
you
might
say
that
is
a
lot
to
think
about,
and
it
is,
would
you
like
to
learn
a
bunch?
More
I
have
great
news
for
you.
This
is
what
the
compute
over
data
track
looks
like
over
the
course
of
the
day.
A
You
just
heard
me
talk
a
little
bit
about
the
overview
and,
what's
going
to
happen,
coming
up
right
after
this
you're
going
to
see
hashes
going
hashgraph.
This
is
a
really
cool
platform.
A
Talking
about
warped
work
from
Eric
I'll
be
back
to
talk
about
one
of
the
potential
platforms
that
are
out
there.
A
platform
called
Baka
yeah.
You
might
have
heard
about
it
that
operates
natively
on
ipfs
and
can
be
the
platform
for
many
other
Platforms
in
the
future.
A
We'll
have
Zach.
Excuse
me,
Matt
from
the
fvm
team,
come
and
talk
about
how
this
integrates
overall
with
chain
consensus.
Then
we'll
come
back
to
zaps
and
showing
you
how
to
build
these
kind
of
apps
that
take
advantage
of
these
underlying
platforms
and
we'll
talk
at
the
end.
We'll
talk
about
fill
mine
and
helping
to
build
an
infrastructure
Network
that
layers
over
the
top
of
this.