►
From YouTube: Overview of Project Bacalhau - David Aronchick
Description
Over three days in April 2022, we brought together 50+ people from across the ecosystem (Starfleet, OuterCore and PLN) to discuss opportunities and architecture of Compute over Data.
- Core problems to be solved across analytics platforms
- How to meet data engineers/scientists with the tools they use today
- Vision (familiar, simplified, collaborative) and high level roadmap for Bacalhau
Learn more: https://www.protocol.ai
Subscribe to Protocol Labs: https://youtube.com/@protocollabs?sub_confirmation=1
Follow Protocol Labs on Twitter: https://www.twitter.com/protocollabs
Computer over Data Summit returns in 2023!
Dates: May 9-10
Location: Boston, MA
Registration: https://www.codsummit.io
A
Thank
you
so
much
this
is
back.
I
know
this
is
the
temporary
logo.
I
I
love
it
too,
but
like
yeah,
it
looks
like
a
dead
fish.
It's
not
it's
not
good.
You
can't
have
a.
What
do
you
call
it
a
new
company
without
having
some
like
incredibly
pithy,
like
data
transform,
so
I'm
I'm.
Where
did
you
use
that
as
a
placeholder?
I
hope
I
actually
also
like
this
official
way
to
process
data.
I
want
to
have
a
caveat
for
this
session.
A
This
my
session
is
gonna,
be
an
extremely
high
level.
Walkthrough,
it's
gonna,
be
something
from
you
know.
My
experience.
A
lot
of
you
don't
know
me
from
adam.
I
I
previously
was
the
I
led
kubernetes
for
several
years.
I
was
the
first
non-examining
vm
for
kubernetes
started
the
kubeflow
project.
I've
been
in
ml
and
data
science
for
a
long
time.
This
is
very
much
like
pains
that
I
have
seen
in
the
world.
However,
this
is
going
to
be
super
high
level.
A
A
So
the
first
thing
I
want
to
do
is
really
set
this
stage
for
the
world
that
we're
going
into
it's
an
incredibly
powerful
world.
It
is
got
so
much
attention
and
we
in
the
web
3
space.
You
know,
obviously
we
know
a
lot
about
this,
but
it's
really
a
general
thing
that
people
know
about.
As
juan
mentioned,
we're,
adding
a
petabyte
of
storage
deals
a
day.
This
is
new
stuff
coming
on
the
network.
A
This
is
like
unheard
of
out
there
in
the
world
for
a
platform
to
be
to
be
adding
this
much
data
constantly,
and
you
know
when
you
look
at
the
analysts
and
whatnot,
they
talk
about
how
big
data
and
data
usage
will
affect
every
industry,
and
I
know
a
lot
of
this
stuff
is
washed
out
again.
A
You
can
go
watch
it
all
online,
but
down
the
left-hand
side
is
basically
every
domain
you
could
possibly
think
about,
and
and
within
the
next
few
years
they
will
all
be
affected
by
the
accumulation
and
use
of
big
data
to
measure
this.
A
lot
of
times,
okay,
well,
are
you
using
it?
Does
it
actually
matter?
This
is
spend
per
year
and
just
a
reminder
like
juan
put
up
something
where
it's
like
total
cloud
spend
is
about.
You
know,
369
billion
is
what
he
proposed.
A
You
can
say,
like
whatever
fermi
estimation,
that
at
between
300
and
400
billion
roughly
in
five
years,
one
third
of
that
ish
will
be
on
data
alone,
like
that's
an
enormous
number
and
an
enormous
piece
of
this
pie,
and
obviously
we
are
participating
in
almost
none
of
that
right.
So
there's
an
opportunity
there,
so
there
you
go
so
and
to
show
you
that
it
is
really
early
innings.
A
You
know
you
kind
of
follow
where
the
big
data
investments
are
happening,
and
here
you
have
like
about
67
billion
dollars
in
2020,
moving
into
investing
in
big
data
platforms
right.
This
is
all
private,
or
I
mean
excuse
me.
This
is
all
like
public
markets
or
private
markets,
and
things
like
that.
So
there's
a
lot
of
stuff
this
isn't
spend.
This
is
betting
on
companies
to
solve
this
problem.
A
I
promise
this
is
the
end
of
the
you
know,
kind
of
setting
the
stage
here,
but
like
it's
kind
of
impossibly
big,
to
talk
about
some
of
these
numbers,
people
estimate
that
about
three
trillion
dollars
is
wasted
on
bad
data,
bad
data
processing
and
things
like
that
yearly
users
generate
that
was
this
is
actually
I
promise
it
wasn't
coordinated,
but
it's
actually
the
number
that
juan
had
circled
in
that
pie,
chart
down
in
the
lower
left-hand
side
about
2.5
trillion
or
excuse
me
exabytes
of
data
generated
by
users
every
day.
A
That
was
the
circle
that
it
adds
up
to
so
on
and
so
forth,
and
and
just
to
give
you
kind
of
inspiration
that
we're
potentially
saving
the
world
here,
google
is
literally
using
big
data
to
help
solve
fusion
right,
so
not
not
a
terrible
thing
to
be
spending
our
time
on,
but
this
is
right.
So
one
talked
about
debug
ability,
monitoring
so
on
and
so
forth.
This
is
the
level
of
success
for
organizations
right
and
you
know
again.
The
numbers
are
small,
but
almost
any
of
these
vary
but
they're
almost
all
entirely.
A
You
know,
70
of
projects
are
successful
or
less
meaning,
like
you
can
basically
flip
a
coin,
and
your
project
is
going
to
be
fail
right
like
it's
that
bad,
so
there's
lots
of
stuff.
We
can
do
to
improve
this
okay,
so
that's
kind
of
the
market.
I
hope
I've
inspired
you
to
you
know
suggest
that
this
is
big
and
we
can
go
after
it
and
make
a
big
difference.
A
So
one
thing
I
want
to
set
the
stages
you
know.
First
off
there
are
many
many
super
smart
data
developers
in
this
room
right
now
you
should
go
talk
to
them.
They
work
in
academia.
A
They
work
at
previous
big
companies,
and
things
like
that
and
they'll
have
a
really
good
sense
for
those
that
are
not
big
data
developers
or
have
not
done
this
previously,
I'm
just
going
to
walk
through
kind
of
the
pain
of
what
they
experience
today
and
I
hope
to
impart
to
you
like
our
target
market
here
I
at
least
at
the
start,
so
just
to
let
you
give
you
a
kind
of
sense
of
this.
A
Our
target
is,
you
know
about
four
million
data
developers
today
who
are
using
big
data
in
some
way,
they're
developing
big
data
pipelines,
transforming
large
data
sets,
or
things
like
that
and
they're
growing
extremely
quickly,
according
to
whatever
various
measures
about
10x
up
from
2016
to
where
they
are
today
so
like
extremely
quickly,
we
can
plug
into
them
and
for
better
or
worse
again,
referencing
juan's
talk.
A
They
are
almost
entirely
ignored
by
a
lot
of
developer,
tooling,
today,
right
so,
for
example,
the
standard
for
developer
tooling
today,
for
example,
doing
breakpoints
and
things
like
that
is
gdd
right.
There
is
no
ddb
for
a
distributed
system
for
data
processing
for
any
sort
of
pipelines,
and
that's
a
nightmare
right
like
how
do
you
set
a
breakpoint
in
your
data
pipeline
to
know
whether
or
not
you
are
transforming
the
thing
wrong?
It's
really
hard
today
really
hard.
A
That
is
like
a
goal
among
many
others
like
if
we
did
nothing
else,
to
take
all
the
standard
development
tools
that
people
have
today
on
their
local
machines
and
enable
them
to
do
work
in
a
distributed
way.
We
have
already
won
we've
already
done
so
much
better
than
whatever
67
billion
dollars
of
investment
have
done
so
you
know
that's
very
modest,
and
we
can
go
much
much
further
than
that.
Oh
sorry,
it's
a
little
bit
washed
out,
but
but
where
do
they
spend
their
time
on
the
left-hand
side?
A
Here
you
have
like
this.
Is
the
standard
flow
for
a
data
developer
today
it's
data
loading
data
cleaning,
I'm
just
reading
them
out
for
existence,
washed
out
a
data,
visualization
model
selection,
model
training
and
scoring
and
deployment
models.
This
is
broadly
from
the
ml
space,
but
the
concept
of
developing
a
model
based
on
data
is
not
specific
to
ml.
That's
really
standard
for
what
people
do.
Basically,
output
a
set
of
things,
create
an
artifact
and
use
that
artifact
in
your
code
or
whatever
it
might
be.
A
A
So
for
those
that
don't
know
what
a
typical
data
pipeline
looks
like
you
start
with
ingestion
and
processing
you
move
to
engineering
and
splitting
it
into
some
form
of
this
is
the
live
stuff.
I
want
to
train
on
or
analyze.
Then
you,
you
have
a
holdback
set
that
you
never
touch
your
training
or
other
code,
but
you
use
to
test
what
you
train
and
you
never.
You
always
want
to
keep
those
separate,
because
if
you
allow
your
test
to
bleed
into
training,
then
you
can
overfit
and
have
other
issues
like
that.
A
Then
you
finally
train
or
create
your
artifact
based
on
the
result,
and
then
you
serve
it
in
your
overall
artifact
and
then,
ideally,
you
loop
back
the
results
to
the
original.
So
if
you
show
this
to
just
about
any
developer
today,
they're
like,
of
course,
this
is
exactly
what
I
do
right
and
it
doesn't
matter,
it
doesn't
need
to
be
on
a
distributed
platform.
This
is
also
what
they
do
locally.
This
is
what
they
do
all
over
the
world.
A
This
is
our
focus
for
now
again,
it's
not
to
say
we
don't
want
to
do
federated
learning
or
you
know,
distributed
training
or
checkpointing
or
any
crazy
things
like
that.
But
again,
if
we
just
solve
this,
we
will
make
such
a
huge
difference
to
the
world.
What
would
they
like
so
to
lay
it
out
for
you?
I
I've
summarized
it
in
one
of
three
categories
here,
I'm
just
going
to
a
cheyenne
plug-in
plug-in.
I
guess
so.
It
really
comes
down
to
one
of
three
things
familiar.
A
They
want
to
understand
it
already,
ideally,
second
simplified,
even
from
where
they
are
today
and
third
collaborative
and
I'll
get
into
what
each
of
these
are
in
a
second,
so,
first
a
little
bit
about
data
pipelines.
I
mentioned
that
pipeline
system
today,
and
people
often
like
very
narrowly
focus
on
just
building
a
model
or
an
artifact
for
your
end
up
your
end
solution,
but
in
truth
it
looks
like
this
right.
It's
many
many
components:
wired
together,
ingestion
transformation,
engineering,
validation,
training,
then
doing
all
those
steps
again
at
scale.
A
You
know
rolling
it
out
and
then
ultimately,
monitoring
and
observing
it
in
in
production.
This
is
what
really
moving
things
to
production
is
and
again
this
is
not
new.
This
is
kind
of
like
what
software
development
looks
like
today.
You
know
each
of
these
steps
are
independent
and
each
of
these
steps
are
independently
composable.
A
Now
the
challenge
here
is
that
every
person
doing
these
developments
will
experience
it
in
a
slightly
different
way.
This
is
a
classic
microsoft
office
thing
where
people
are
like.
Well,
why
do
microsoft
office
has
so
many
features
in
it?
You
know
I
don't
use
95
of
it
turns
out.
Everyone
uses
a
different
five
percent,
which
is
super
annoying
if
you're
a
product
developer,
but
it
is
the
reality
of
the
world.
What
does
this
look
like
inside
a
big
organization?
Well,
here
you
go.
Microsoft
actually
published
this
paper
about
three
years
ago.
A
They
did
their
own
survey
internally,
one
of
the
most
sophisticated
micro.
You
know
machine
learning,
data
development
organizations
in
the
world.
They
have
159
different
tools.
159.
Can
you
imagine
if
you're
an
sre,
it
might
be
not
saying
like?
Oh,
I
have
to
support
whatever
10
years
old
cntk
like
what
the
hell,
but
what
11
people
11
people
need
it.
So
what
are
you
gonna
do
tell
them
to
go
f
themselves
now.
A
So
you
know
this
is
again
super
standard,
but
it
it
really
highlights
a
need
for
us
to
understand
that
people
are
going
to
use
the
tools
they're
going
to
use
and
we
need
to
encapsulate
those
tools,
so
they
can
keep
using
it
in
the
way
they're
familiar
with,
but
still
give
them
an
opportunity
to
participate
in
this
very
public
data
platform
make
sense.
A
Okay.
So
that's
what
familiar
is
now
to
just
up
the
level
of
difficulty.
We
haven't
even
got
we.
It
was
just
at
the
tools
we
haven't
even
gotten
to
platforms
right
so
by
platforms.
You
have
compute
providers
right
and
you
got
a
lot
of
those
and
then
you
have
data
platforms
and
you
got
a
lot
of
those
and
those
are
useful
as
well,
except
I
think
there
are
too
many
choices
right.
We
should
get
rid
of
all
of
them.
This
is
what
users
actually
want.
A
So
here's
said
this
is
wikipedia
from
said
said
was
invented
in
1974.,
so
that's
pretty
good,
48
years
old.
I
think
we
should
bring
data
science
back
to
the
70s
is
what
I
think
right
like.
We
should
make
it
as
easy
to
use
this
48
year
old
technology
on
your
brand
new
technology,
as
you
do
today,
and
I
cannot
tell
you
how
many
people
you
said
it
is
so
commonly
used
out
there
just
to
process
a
csv
file
or
something
like
that.
A
It
is
a
wonderful
tool
like
let's
not
reinvent,
that,
let's
not
try
and
throw
that
out
in
fed.
Instead,
let's
try
and
meet
folks
where
they
are
okay,
so
that's
for
me
simplified
now.
Let
me
walk
you
through
what
the
data
scientists
workflow
actually
looks
like
here.
You
have
a
very
very
standard.
This
is
like
the
tutorial
in
machine
learning
and
data
science.
This
is
how
you
create
the
housing
price
data
frame
right.
So
this
is
something
that
you
go
to
any
tutorial
you're
going
to
see.
A
One
of
these
things
predict
my
house
price
for
me
in
jupiter
notebook
there
you
can
see
it's
about.
I
know
three
lines
of
code.
Half
of
that
is
literally
like
loading.
The
thing
in
and
you're
done,
pretty
simple
and
you
can
get
going
to
do
that
exact
same
thing
over
there
in
hadoop
a
also
whatever
15
year
old
technology.
A
It
looks
like
this,
and-
and
this
is
still
missing,
like
half
of
it
right-
that's
how
bad
it
is,
and
so
you're
asking
a
data
scientist
who
had
something
working
pretty
well
locally
to
now,
translate
all
that
mess
into
this
for
the
exact
same
functionality
like
not
so
good,
because
it's
just
super
painful
now
to
be
clear,
like
it's,
not
just
even
dave
developers
that
are
facing
this
pain.
Sre's
faces
thing
too.
A
So
here
you
go,
I'm
gonna
take
you
through
a
play
in
one
act,
so
the
data
scientist
now
has
her
local
machine
and
it's
running
perfectly
and
she's
found
her
data
set
and
her
model
works
locally
and
it
converges
presto,
I'm
ready
to
go
so.
The
first
thing
she
does
is
go
to
our
itops
person
to
provision
an
entire
cluster
again.
This
is
something
that
most
folks
actually
face.
A
You
can't
just
get
unlimited
compute,
you
don't
have
protocol
labs
credit
card,
so
you
need
to
like
go
to
your
central
I.t
staff
and
get
it.
The
first
thing
she
has
to
do
is
like
provision
it
and
that
by
itself
takes
forever
right,
so
the
id
ops
person
just
goes
around
says:
okay,
I'm
having
100
things
too.
Maybe
you
filed
a
ticket
on
it
I'll
get
to
this
afternoon
later
this
week,
whatever.
Finally,
it's
provision,
and
now
the
it
officers
says:
okay,
well
great,
I'm
glad
you
provisioned
it
now.
A
Can
you
do
this
right,
like
here's
half
a
dozen
things
or
more,
that
she
has
to
do
just
to
take
that
code?
That
runs
locally
perfectly
well
to
production,
and
many
of
these
things
are
because
she
works
in
an
itops
organization
that
requires
you,
know
acls
and
various
things
like
that.
You
have
to
rewrite
it
into
java.
That's
a
super
common
request
which
no
data
scientists
want
to
do.
I
promise
you
use
out-of-date
libraries
that
have
passed.
A
You
know
global
security
requirements
because
we're
not
going
to
allow
anything
to
deploy
that
touches
production
data
without
this
various
things
like
this,
it's
just
a
lot
of
stuff
that
they
asked
and
she's
just
like.
Well,
I
just
want
to
run
that
simple
job.
Why
can't?
I
just
do
that,
and
the
reason
is
is
because
that's
the
requirements,
so
she
does
that
that
sucked
and
she
says
fine,
great
off
you
go
and
she
provisions
it.
It
runs
and
success.
A
It
actually
did
except
they
forgot
to
turn
it
off,
which
happens
all
the
time
as
well
and
presto
now,
you've
just
blown
through
your
entire
monthly
budget,
because
you
forgot
that
these
were
gpu
machines
that
cost
whatever
two
thousand
dollars
an
hour.
Super
super
common
situation.
You
see
this
all
the
time
and
you
see
like
super
pernicious
behaviors
around
this,
where
it's
like.
Oh
I'm,
gonna
secretly
like
spin
up
and
use
someone
else's
cluster,
I'm
gonna
plant
on
gpus,
because
we
have
a
limited
number
of
gpus.
So
I'm
not
gonna.
A
A
We
can
do
better,
and
finally,
I
want
to
inspire
you
around
collaborative
so
literally.
The
reason
I
joined
my
protocol
lab
six
months
ago
was
like
I
want
to
try
and
avoid
you
know
our
children
and
children's
children
living
in
a
barren
hellscape-
and
this
is
it
right
like
it's
really
positive.
A
The
problem
is,
is
collaboration
around
science
today
is
really
hard
and
data
is
even
harder
right.
So
today
you
have
these
open
data,
sets
all
over
the
place,
literally
petabytes
of
data,
very
valuable
data
out
there
in
the
world
that
are
awesome
right
here.
You
can
see
the
cancer
genome
atlas.
This
is
just
hosted
on
amazon.
Actually,
technically,
it's
not
host
on
amazon.
A
So
it's
on,
like
an
ftp
site,
you
can
click
a
button
that
provisions
an
s3
bucket
and
copies
it
there,
which
means
you
now
start
paying
amazon
for
it,
which
is
messed
up
as
well,
but
suffice
to
say,
like
there's
at
least
a
catalog
out
there
of
these
things
right
so
far
so
good,
so
this
is
landsat.
Alex
is
going
to
talk
a
little
bit
about
landsat.
Later
landsat
is
already
hosted
on
ipfs,
which
is
awesome,
but
today
let's
say
you
have
three
scientists
that
come
together
and
they're
like
well.
A
I
would
like
to
use
landsat
it's
a
super
popular
satellite
thing.
Satellite
data
set
that
is
donated
by
governments
over
the
world,
so
number
one.
The
data
scientist
says
I
want
to
create
a
tiled
version,
so
I'm
going
to
take
a
subset
of
the
original
version
that
is
tiled
so
that
it's
focusing
on
different
areas
that
are
interesting
to
me
in
this
case
she's
a
volcanoes
volcan
volcanologist,
all
right.
A
Whatever
anyhow,
she
wants
to
study
volcanoes,
so
she's,
like
I'm,
just
gonna,
grab
a
picture
of
that
volcano
second
person
wants
to
do
scale
so
sam
same
thing
before
just
reduced
pixel
density.
This
is
a
super
common
requirement
because
of
you
know
not
needing
that
kind
of
fidelity
and
images
being
very,
very
large,
and
then
the
third
data
scientist
says.
A
Oh,
I
want
to
do
the
same
thing,
but
I
want
to
actually
grayscale
it
again
very
common
when
you're
building
your
artifacts
is
to
use
lower
resolution
versions,
because
you
don't
need
that
higher
resolution.
You
can
achieve
the
same
thing
at
you
know,
one
tenth
or
one
hundredth
of
the
cost
by
working
on
these
smaller
sets
so
far
so
good.
So
each
data
scientist
has
gone
off
and
done
her
own
thing
and
we
have
a
fourth
data
scientists
come
from
says.
Oh,
I
actually
want
all
three
of
those
right.
A
I
want
it
scaled
to
interesting
elements.
I
want
it
tiled,
because
I
don't
need
all
that.
You
know
various
land
and
water
and
I
want
it
grayscale,
but
she
can't
she
can't
touch
any
of
those.
Those
all
went
off
into
private
research.
You
know
they
didn't
republish
their
methodology
for
doing
these
particular
things
and
again,
oftentimes
papers
will
describe
that.
But
it's
like
we
were
talking
about
last
night.
A
I
was
talking
about
alfonzo
he's,
like
I
hate
reading
papers,
because
the
first
thing
I
want
to
do
is
at
least
try
and
attempt
to
figure
out
how
the
hell
they
like
did
this
thing
and
it's
hard
because
oftentimes
they
don't
publish
it
correctly.
Their
code
doesn't
work
on
except
anywhere
on
their
machine
and
so
and
so
forth.
So
not
so
good
that,
however,
with
bakayow,
we
can
change
this
right
same
exact,
stereo
situation,
except
in
every
case
there
they
republish
the
cid
and
now
it's
out
there,
and
I
can
see
what
happened.
A
I
can
see
lineage,
I
can
see
from
the
original
data
set
it
came
down
and
what
they
did
and
now
the
fourth
person
comes
along
and
says:
oh
great,
I'm
just
going
to
grab
all
those-
and
you
know
use
that
as
my
c
data
set
and
there's
a
variety
of
ways,
we
can
go
about
it
achieve
that,
but
not
just
that,
then
that
becomes
collaborative
right
and
they
can
get
leverage
on
that.
And
so
now
the
next
person
comes
along
says.
Oh,
I
just
want
to
know
what
they
do
right.
A
I'm
just
going
to
save
time.
Make
sense
so
unprecedented
collaboration
because
of
this
you
know
the
way
that
we're
operating
here.
So
that's
the
scope.
I
hope
I'm
being
like
getting
to
the
inspiration.
You're
excited
familiar,
simplified
collaborative
again,
these
are
just
my
words.
I'd
love
to
as
a
community
come
together
and
figure
out
what
our
core
tenets
are
and
and
move
forward
from
there.
So
can
we
improve
big
data
with
small
changes
for
data
developers
and
that's
what
computer
over
data
and
final
coin
is
or
the
backhaul
yao
project
vision?
A
So
this
is
my
words
again
take
it
for
what
you
will.
I
think
there's
lots
of
crafting
here.
That's
too
buzzwordy,
but
you
get
the
idea.
I
think
we
can
transform
big
data
by
giving
developers
simple
first-class
distributed
tools
and
unlocking
a
collaborative
ecosystem.
This
is,
I
think,
our
mission
again
lots
of
honing.
I
think
we
should
probably
have
an
unconference
just
talking
about
how
we
talk
about
this
thing,
but
setting
that
aside,
this
is
what
I
would
like
to
do.
It
looks
like
you
know.
All
the
things
I
mentioned
are
already.
A
We
simplify
it,
give
meeting
people
where
they
are
using
these
tools
that
they
already
know
and
love.
We
deliver
performance
improvements
because
we
can
and
we'll
I'll
talk
about
that
in
great
depth
in
a
moment
and
then
folks
later
will
and
launch
this
new
collaborative
science
community.
A
What
does
this
look?
Like?
You,
take
a
10
gigabyte
file.
Csv,
you
upload
it
to
ips.
From
that
you
get
a
cid.
You
then
execute
using
the
command
line.
We
have
a
downloadable.
You
can
go
to
backalia.org
right
now
and
install
the
stupid
binary
yourself.
You
submit
your
job
in
your
cid,
you
name
the
cid
and
then
in
the
command.
A
You
name
the
command,
and
so
this
one
right
here
is-
I
use
said,
as
I
mentioned
earlier,
to
process
the
large
csv
and
filter
it
down
to
just
the
things
within
whatever
50
kilometers
of
portugal,
pretty
simple
stuff
stuff
that
data
scientists
do
every
single
day
and
then
I
fetch
the
results
presto.
I
have
added
a
new
tool,
but
most
of
this
is
totally
understandable
to
a
data
scientist,
no
matter
where
they
are,
I
didn't
have
to
use
hadoop
or
hdfs.
I
didn't
have
to
figure
out
rewrite
this
in
java.
A
I
didn't
have
to
like
do
any
kind
of
like
figure
out
concurrency
or
resolute
job
resolution.
Orchestration
presto
just
works.
You
know,
in
addition
to
that,
no
temporary
storage
using
mostly
idle
compute,
the
results
were
automatically
added
back
to
the
chain
again.
Privacy
and
things
we're
gonna
have
to
tackle
right
now.
We're
just
focused
on
public
data
and
performance.
As
one
mentioned,
familiar
commands
automatically
resolves
failures,
there's
retries,
there's,
concurrency
and
ideally
quite
cheap,
and
I
haven't
even
gotten
to
the
biggest
thing,
which
is
no
egress.
A
You
didn't
have
to
move
this
10
gigabyte
file,
that's
it.
It
was
already
there.
It
was
like
running
locally
and
you
know
it
obviously
gets
much
much
worse.
As
the
data
size
gets.
Bigger
egress,
I
think,
is
amazon's
most
profitable
thing.
Let's
try
to
I'm
not
going
to
say
bad
things
about
you.
Stop
there
you
go
so
we
go
back
to
our
play
in
one
act.
A
Data
scientist
comes
along
says:
here's
a
data
set,
that's
perfect!
How
do
I
engineer
it
presto?
She
now
submits.
She
adds
her
cid
in
there.
She
writes
her
said
command.
It
ran
great
locally.
She
knows
it's
going
to
run
great,
we're
already
checking
bash
syntax,
which
is
convenient,
because
I
cannot
tell
you
the
number
of
times
that
I
personally
have
made
a
synthetic
mistakes.
A
She
runs
off,
it
goes
and
after
every
time
it's
all
done,
and
now
our
ideology
knows
how
many
cat
videos
are
up
to
to
youtube
every
second,
so
she
has
stuff
to
do
too
very
important.
So
you
might
say
wait
a
second.
What
about
these?
A
These
are
all
good
things:
homomorphic
encryption,
selected,
execution,
gpus
enclaves
so
on
and
so
forth.
You
know
the
vision:
is
there
nope?
Not
not
yet?
Okay,
like
we're
gonna
get
there.
We
have
the
vision
we
want
to
achieve
all
these
things
and
enable
great
domain
specific
things
exactly
like
juan
was
saying.
We
need
to
enable
businesses,
organizations
projects
to
go
and
do
great
things
on
our
platform,
but
not
yet
our
goal
is
exactly
like
juan
said.
You
know,
let's
achieve
performance.
First,
let's
make
sure
jobs
run.
A
Let's
show
them
make
sure
they
run
well
efficiently.
They
resolve
correctly,
they
recover
from
errors.
You
know
all
these
things
that
are
kind
of
the
block
and
tackle
for
even
a
system
being
valuable.
Our
roadmap
is
again.
You
know
that
that
tilde
is
very
doing
a
lot
of
work
approximately
in
may.
We
would
like
to
launch
to
public
consumption,
no
incentives,
100
nodes
data,
smaller
than
32
gigabytes,
fitting
into
a
single
sector,
ideally
on
a
single
machine.
A
One
cid,
only
public
data
only
deterministic
only
cpu,
only
no
incentive
structure,
no
verification
of
results.
So
again,
this
is
not
for
general
use,
but
ideally
anyone
in
the
world
will
be
able
to
consume
it,
use
it
and
engage
by
october
again
until
they
do
a
lot
of
work
here.
A
Approximately
thousand
plus
nodes
we're
not
gonna,
stop
like
we'd
love
to
get
ten
thousand
hundred
thousand
a
million
so
on
and
so
forth,
running
ten
thousand
jobs,
one
petabyte
of
processing
across
many
files.
Ninety
nine
percent
job
success
rate,
90
or
49
malicious
nodes,
supported,
dag
execution
so
distributed
acyclic
graph.
A
Allow
multiple
steps
to
connect
together
a
primitive
reputation
system
likely
only
at
a
reporting
level,
not
injecting
like
choice
on
whether
or
not
I
want
to
deploy
to
a
reputable
node
or
a
provider
and
swappable
systems
swappable
their
verifications
of
execution.
So
I'll
reputation
make
sense,
so
you
might
ask
like
incentives.
Why
would
I
choose
to
run
this
seriously?
Not
yet
I
promise
like
we're
going
to
get
there
and
by
incentives
I
mean
tokens
verification.
A
All
these
things
that
are
required
stake
in
particular,
is
dependent
on
this.
How
we
get
there
tbd
but
like
unless
we
have
a
well-functioning
system,
there's
no
point
in
going
forward
to
figuring
out
other
things
out
really.
So,
let's
get
to
a
high
functioning
system
first,
it
is
not
that
we
are
ignoring
this.
It's
just
a
little
bit
later
promise
we!
I
cannot
stress
this
enough.
A
We
expect
there
will
be
many
structures,
most
of
which
will
not
be
done
by
this
project
right
that
I
cannot
stress
that
enough,
either
right
trusted
environment,
execution,
gpus,
a
super
fast
resolution,
time
for
subnets
and
scheduling
and
all
those
kind
of
things
wonderful.
We
will
support
all
of
those,
ideally
we'll
support
them
via
interfaces
and
and
loosely
coupled
systems,
and
by
that
I
mean
this
right,
extensibility,
so
luke
and
kai
momentarily
will
talk
about
the
overall
architecture
interfaces
and
plugability.
They
will
go
into
this
diagram
as
well.
A
A
These
are
core
elements
that
we
expect
to
have
many
implementations,
most
of
which
will
not
be
written
by
us.
It
is
our
job
to
build
clean
interfaces
and
explain
to
people
how
to
extend
into
systems
that
they
can
build
for
their
own
incentive
structures
and
other
things
like
that,
and
we
provide
core
primitives
that
you
know
work
out
of
the
box,
but
ideally
you
can
swap
out
some
so
critical
from
day
zero.
A
Our
system
must
run
on
these
interfaces
like
there's,
no
cheap
and
cheerful
way
of
not
having
interfaces
at
the
start
even
at
launch.
We
expect
to
have
those
those
various
optionality
and,
like
I
said,
domain
specific
customization
over
time
so
sounds
great
when
well,
I
already
told
you
right
nothing
new
here
again,
tilde
approximate,
very
approximate
software
engineering
so
like
what's
the
rule
of
thumb,
double
the
time
in
minus
two
weeks,
something
like
a
add
two
weeks,
but
it's
not
about
the
data
or
the
idea.
A
Now
the
number
one
way
to
identify
that
someone
is
an
absolute
blowhard.
Is
they
put
a
slider
on
that
has
like
a
quote
from
steve
jobs
right?
So
it's
not
global.
She
does
it's
me
right.
I
said
it,
but
it's.
This
is
actually
critically
important
and
it
just
leverages
exactly
what
juan
said
right
like
the
disease
is
thinking
the
idea
matters.
I
cannot
stress
enough.
The
idea
does
not
matter
at
all.
This
is
about
execution.
A
Ux
is
the
killer
feature
ux
at
every
phase.
Is
the
killer
feature
for
the
data
developer
for
the
sre
for
the
storage
provider
for
the
eventual
compute
provider,
the
browser
everything
ux
is
the
killer
feature.
We
cannot
move
forward
unless
this
works
liquid
smooth,
but
you
say
I
want
it
now
how
we
move
faster.
A
It's
all
of
you.
We
have
some
key
skills
and
hires
that
we
are
missing
right
now.
It's
like
three
of
us
doing
the
coding,
so
that's
so
good
we're
hiring
very
fast.
Obviously
we
would
also
love
many.
We
have
lots
of
partners
in
the
room
right
now
or
on
the
stream.
We
would
love
to
understand
where
you
would
like
to
go
and
see
what
we
can
do
in
our
core
project
to
support
you.
A
So
what
interfaces
core
primitives
and
things
can
we
take
off
your
plate
and
work
collaboratively
on
and
then
figure
out
where
to
go
from
there
right,
so
that
is
stuff
like
storage
or
plugging
in
schedulers,
or
things
like
that,
the
time
to
do
to
suggest
things
to
us
is
now
and
even
if
it's
just
coming
by
and
looking
at
the
already
published
interfaces
documentation
and
so
on.
That's
enough,
but
if
you
can,
you
know
collab
collaborate
with
us.
A
I
know
that
we
talked
to
folks
about
like
how
we
execute
with
wasm
how
we
do
this,
how
we
do
that
we
would
love
to
talk
more
and
with
that.
That
is
my
overview.
We're
on
time,
which
is
very
I'm
very
pleased
about.