►
Description
Project Bacalhau is focused on helping projects increase trust via open data processing pipelines. This talk describes how Project Bacalhau is working to improve the state of Decentralized Science via public data processing use cases.
A
So
I'd
like
to
start
the
talk
with
a
little
bit
of
a
reflection,
kind
of
talking
about
Nicholas's
history
of
the
scientific
process,
specifically
the
challenges
that
we
want
to
help
with
open
infrastructure.
A
So,
let's
start
with
an
example
of
how
it
started
this
is
this
is
a
visualization
of
a
1768
painting
of
an
of
an
inventor
who
created
this
new
form
of
science,
known
as
the
air
pump,
which
we
refer
today
as
the
vacuum,
and
you
can
see
the
audience
is
surrounding
with
happiness
as
excitement,
floods,
the
room
to
observe
the
fate
of
the
the
bird
in
the
glass
vial.
So
this
is.
This
is
a
fun
example,
because
it's
open
science,
it's
happening
in
real
time.
It's
verifiable!
It's
trustless!
A
You
can
see
that
this
is
actually
playing
out
in
real
time,
and
maybe
this
is
sort
of
some
some
versions
of
what
we're
trying
to
get
back
to
in
in
the
new
scientific
movement.
So
a
little
bit
of
a
reflection
I
follow
a
number
of
folks
on
what
we
can
call
DSi,
Twitter
particular
Jocelyn
Pearl,
and
she
had
some
interesting
science
memes.
A
That
I
thought
would
be
fun
to
share,
and
so
this
is
a
bit
of
a
reflection
about
the
challenge
that
academics
are
in
today
and
to
give
you
a
little
bit
of
background,
my
background
is
more
so
in
infrastructure
in
Cloud
infrastructure,
so
I've
been
learning
more
about
academics
and
it's
been
really
fascinating
to
see
the
just
the
systemic
challenges
that
academics
have
to
go
through,
because
these
are
people
that
really
want
to
make
the
world
a
better
place
and
for
no
fault
of
their
own
or
for
no
fault
to
their
institutions.
A
The
structures
have
sort
of
solidified
in
a
way
that
makes
it
very
painful
for
them
do
the
work
they
do,
and
in
fact
this
was
one
one
academic
person
who
is
studying,
I
want
to
say,
Advanced
physics,
computational
work,
she's,
just
sort
of
at
the
end,
like
she's,
trying
she's
with
the
grant
process,
she's
trying
to
get
published,
she's
just
prohibited
from
doing
the
work
that
she
really
cares
about,
and
so
what
we're
seeing
is
a
movement
of
brilliant
scientists
who
are
willing
to
take
a
different
different
direction
and
I
think
labdao,
beta
Dao
molecule
doubt
represent
the
the
best
aspirations
of
people
that
are
going
to
start
new
Industries
in
this
space,
and
so
in
particular,
we
want
to
empower
these
folks
with
the
best
tools
so
that,
if
you're
a
researcher
you've
been
in
an
institution,
you
know
you
have
your
python.
A
Your
your
notebook
you've
been
you've
been
doing
your
work
in
that
space.
We
want
to
give
you
the
as
good
or
better
Tools
in
this
in
this
space.
So
there's
a
couple
things
that
we
need
to
fix.
A
First
off
the
public
Cloud
platforms
are
very
robust,
but
they're,
often
more
so
oriented
towards
closed
systems,
and
then,
even
when
you
get
into
the
economics
of
web
3
systems
and
the
things
that
Nicholas
was
describing,
tokenization
and
those
sorts
of
payments
become
an
issue
web
free
projects
that
are
innovating
in
the
decentralized
science
space
Also
deserve
equally
first
class
decentralized
infrastructure
that
is
on
par
with
what
they're
trying
to
do,
and
now
there
have
been
open
storage
platforms
like
ipfs
and
filecoin
for
some
time,
but
we're
just
now
seeing
the
burgeoning
of
compute
platforms
and
back
of
Yahoo
is
just
one
of
many
projects
that
are
trying
to
bring
that
compute
capability
to
the
ecosystem,
and
so,
if
you
were
to
visualize
this,
it's
particularly
for
folks
that
may
not
have
a
technical
background.
A
We'll
have
your
storage
infrastructure
here
on
the
bottom.
You
can
store
whatever
you
want
as
much
super
low
cost
or
free.
You
will
have
some
compute
infrastructure
which
may
be
back
or
another
another
project
in
our
space
and
then
you'll
have
apps
like
I.
Think
you
know,
lab
down
represents
a
good
example
of
this.
A
You
have
IP
nft
protections
for
scientists
that
can
fund
their
research
and
they
can
give
their
financial
incentives
and
ultimately,
hopefully,
over
time
we
can
rebuild
an
industry
so
that
scientists
can
be
self-supporting
and
can
do
some
great
work
on
their
own,
all
right
so
to
dive
a
little
bit
deeper
into
the
technical
piece.
I
want
to
give
you
guys
an
example
of
a
problem
that
came
up
through
the
Max
Planck
Institute,
and
this
was
a
project
that
was
launched
about
two
years
ago.
A
It's
named
Eureka
and
it
stands
for
finding
the
challenges
in
measuring
Earth
temperature.
It
turns
out
that
climate
scientists
struggle
with
this
issue
of
accurately
identifying
the
temperature
of
the
ocean
when
clouds
appear,
because
clouds
can
make
it
difficult
to
accurately
measure
the
the
humidity
and
temperature
of
the
ocean
in
certain
places.
So
they
basically
launched
this
large-scale
survey.
A
So,
at
the
end
of
this,
you
have
these
terabytes
and
petabytes
of
data
across
different
universities,
different
gdpr
jurisdictions,
different
academic
ownership
of
the
data
and
all
trying
to
solve
this
problem,
which
is
truly
good
for
Humanity.
The
problem
gets
magnified
significantly,
and
so
this
is
an
example.
If
you
go
to
back
of
yahoo.org
case
studies
where
the
team,
through
through
the
Eureka
project,
has
actually
posted
those
data
sets
on
ipfs
and
you
can
get
access
to
most
of
that
raw
data.
A
Today,
it's
been
cataloged
that
lives
in
different
places
and
so
they're
starting
to
stitch
together
this
data,
rather
than
having
it
be
siled
in
an
individual
University
or
an
individual
private
repository
it's
now
publicly
available.
But
some
of
the
challenges
we
want
to
solve
are
that
when
you
get
to
very
large
data,
sets
like
that
terabytes
and
petabytes,
it
can
be
difficult
to
move
that
data
quickly
across
long
distances.
It's
so
large.
It
just
takes
time.
A
The
network
pipes
have
have
limitations,
and
so
one
of
the
things
that
we're
very
interested
is
sending
the
compute
to
where
the
data
lives.
So
if
you
want
to
host
a
large
portion
of
the
scientific
data,
many
petabytes
of
information,
we
want
those
researchers
who
are
going
to
do
pre-processing
of
the
cloud
images
Cloud
masking
to
send
that
compute
to
where
the
data
lives
it's
much
more
efficient
and
you
get
again
and
the
Best
in
Class
experience.
You
would
get
from
a
public
Cloud,
but
through
these
web3
Technologies.
A
And
so,
let's
kind
of
bring
this
back
a
little
bit
to
the
impact
that
we
would
like
to
have
on
the
way
the
researchers
work
today.
So,
let's
imagine
this
researcher
says:
I
just
did
some
pre-processing
of
this
Eureka
data
set
and
through
my
sophisticated
machine
learning,
I
was
able
to
refine
the
images
of
clouds,
and
now
we
can
more
accurately
measure
the
temperature
of
the
of
the
Earth's
ocean,
surface
temperature
researcher.
2
says
great:
can
you
send
me
the
files
I'd
love
to
to
reproduce
your
work?
A
I
fork
my
code
in
GitHub
from
other
people,
all
the
time
and
I
build
off
their
work.
I
would
love
to
do
the
same
for
you.
This
is
what
I
expect.
This
is
how
how
technology
works
today
and
all
these
other
researchers
say
the
same
thing
as
well.
So
now
you've
got
an
audience
in
a
community
in
a
bit
of
a
scale
concern.
A
So
this
traditional
approach
of
having
the
data
live
in
an
FTP
server
in
an
academic
institution
is
going
to
hit
up
against
scale
limitations
very
quickly,
and
so
our
solution
is
salted.
Cod
fish,
the
the
name
of
the
project,
is
a
bit
of
a
play
on
the
Portuguese
term
for
COD,
which
is
compute
over
data,
and
that's
how
we
got
our
name
and
so
the
goal-
and
this
is
this-
is
stealing
from
a
from
a
famous
builder
in
the
DSi
space.
So
quote
about
the
back
of
y'all
project.
A
Is
that
now,
when
the
data
and
the
processing
are
completely
in
public,
the
resources
are
automatically
shared
automatically.
It's
it's
default
to
open,
which
I
think
a
lot
of
you
are
hearing
in
the
space.
Not
only
is
it
natural
for
the
web
3
Community,
but
it's
really
natural
for
what
academics
want,
even
if
they
are
limited
in
some
way
by
their
institutions,
and
so
now
you
have
an
annotated
graph.
You
share
it,
you
build
on
it
and
in
many
ways
the
scientific
Community
moves
faster.
So
going
back
to
the
technical
schematic.
A
Now
we
have
all
this
data
here
that
lives
in
ipfs.
We
send
a
copy
of
that
information
and
it
lives
to
these
different,
pinning
Services
there's
a
similar
architecture
for
filecoin
which
we'll
we'll
get
to
in
a
bit.
But
now
the
data
lives
in
all
these
places,
everyone's
contributing
to
storing
a
copy
of
that
data,
and
so,
as
a
researcher
I
bring
my
data
I
bring
my
code.
Maybe
it
lives
in
GitHub,
it's
a
Docker
container.
We
send
it
to
the
back
of
Yale
cluster.
It
gets
processed
now.
The
result.
Data
also
lives
in
ipfs.
A
A
So
let
me
give
you
a
brief
technical,
a
little
bit
more
of
a
technical,
deep
dive.
If
those
of
you
who
are
more
Hands-On
with
ipfs
and
Docker
technology,
the
back
of
y'all
platform
is
meant
to
treat
Docker
containers
and
was
and
binaries
as
first-class
citizens,
so
to
translate
that
into
a
little
bit
less
technical
language.
If
you
have
any
work
that
you've
done
as
a
scientist,
you've
written
python
libraries
you've
written
something
in
Julia.
All
that
can
be
wrapped
into
a
container
when
you
submit
it
to
the
bacoya
network.
A
One
of
the
compute
nodes
will
bid
on
your
job,
maybe
because
it
has
data
locally
or
because
it
has
availability
to
process
your
job
that
all
happens
transparently
to
the
user.
The
data
is
moved
between
ipfs
and
filecoin
transparently,
and
then
the
user
gets
their
results
back,
just
as
if
they
were
running
locally
and,
in
fact
recreating
that
local
development
experience
is
a
big
Focus
for
us
with
the
architecture,
and
so
this
is
an
example
of
what
it
looks
like
here.
A
Obviously,
simple
CLI
is
our
first
goal:
there,
eventually
we're
going
to
be
trying
to
build
out
some
more
user
user
web
user
experience,
capabilities
for
the
platform,
and
so
just
briefly,
I'd
love
to
show
you
a
video
of
what
this
looks
like
in
action
here.
On
the
left
hand,
side
you've
got
a
command
line,
you're
going
to
submit
some
back
layout
jobs
on
the
right.
You've
got
a
bunch
of
files
that
live
in
ipfs.
These
are
landsat
images;
they
have
clouds,
they
need
to
be
processed,
we're
going
to
run
a
job
here.
A
In
fact,
you
can
see
this
is
the
backlog
Docker
run
command,
we're
going
to
do
a
simple
image,
resizing
job
against
that
and
when
we
submit
that
it's
going
to
go
off
to
the
internet
to
a
back
of
Yao
cluster
that
lives
somewhere
around
the
world.
When
we
say
back
of
your
list,
we
can
see
that
that
job
is
actively
running,
then
it
gets
completed
and
we
have
a
nice
ipfs
CID,
which
everyone
in
the
world
can
access
afterwards
now
I
go
say,
go
get
my
job.
It
brings
me
back
all
the
results.
A
Some
standard
output,
all
the
if
there
was
any
errors,
I
get
that
information
and
very
quickly
I
get
my
new
file.
That's
been
automatically
resized,
so
that's
available
now
on
the
internet
for
everybody
else
to
view,
and
it's
all
entirely
transparent.
If
you
have
interest
or
if
you
have
a
an
opportunity
to
make
use
of
public
compute
data
processing,
please
reach
out
to
us.
You've
got
my
contact
information
here.
We
also
have
a
slack
channel
on
the
filecoin
slack
Channel
jump
in
ask
questions,
give
us
feedback.