►
Description
Svetly Metodiev and Aligned.co - custom FPGA hardware for high performance compute workloads: Filecoin sealing, zk proofs, and more.
Alexander Borzunov and Petals.ml - Run 100B+ large language models (LLM) like BLOOM at home, BitTorrent‑style
A
Okay-
and
we
are
good
all
right-
everyone,
hello,
welcome
everyone.
That's
listening
remotely
to
the
compute
over
data
working
group.
We
are
fortunate
to
be
joined
from
with
svetly
from
aligned,
which
is
going
to
tell
us
about
some
really
interesting
work
that
they
do
building
custom
hardware
for
various
compute
networks,
we'd
like
to
not
only
cover
decentralized
compute
networks
themselves,
but
also
the
tool
manufacturers
that
go
into
those
the
technology,
Builders
and
so
aligned.
We
are
first
physical,
Hardware
manufacturer
if
anyone's
building
decentralized
networks.
A
These
are
the
guys
to
get
in
touch
with
to
make
sure
you
think
about
optimization
and
then
secondly,
we're
gonna
hand
it
over
to
Alexander
who's,
going
to
tell
us
more
about
the
work
they're
doing
with
pedals
ml,
which
is
a
decentralized
ml
training
network
and
we're
going
to
find
out
all
about
all
the
things
that
are
going
on
there
and
whether
or
not
open
AI
is
actually
open
or
not
just
kidding,
okay,
so
anyways
svetly.
A
Thank
you
so
much
for
for
getting
the
content
ready
for
us
I'll
hand
over
to
you.
If
you're
ready.
B
Thanks
Wes
appreciate
that
so
yeah
I'm
coming
from
aligned
what
we
do
is
ceiling
as
a
service
and
we're
here
to
talk
about.
You
know
in
general,
decentralized,
compute
and
so
ceiling
being
a
specific
use
case
that
really
isn't
on
the
back
end
in
the
trenches
of
PowerPoint.
So
it's
a
great
thing
to
to
discuss
here
and
let
you
guys
know
what
we're
up
to
and
then
you
know
what
we
can
do
as
far
as
compute
as
a
service
in
general.
B
So
just
taking
it
off
we'll
talk
about.
You
know
about
a
line
so
for
so.
B
Is
a
high
performance,
compute
company
offering
Solutions
as
a
service,
so
what
we've
done
is
our
team
comes
from
a
background
of
video
compression
and
rendering
where
they
specialize
in
building
their
own
hardware
for
big
data
processing
jobs.
Now
we've
taken
that
skill
set
and
applied
it
to
decentralized
networks
to
blockchains,
where
you
have
big,
you
know,
compute
needs
such
as
ZK
proving
and
then
of
course,
ballpoint
ceiling.
So,
as
I
mentioned,
you
know
we
specialize
in
building
our
Hardware,
we
utilize
fpgas
and
so
that's
a
field,
programmable
gate
arrays.
C
B
Our
own
boards,
we
with
super
fast
connectors
on
those
boards,
we'll
write
our
own
software
on
top
of
those
boards
to
take
advantage
of
and
optimize
for
certain
compute
jobs.
So
what
we've
done
is
basically
built
one
of
these
models
for
fall,
Point
ceiling.
We
did
this
back
in
in
May
of
last
year
as
kind
of
a
proof
of
work,
proof
of
concept,
and
we
presented
it
at
Phil
Austin
and
what
we
found
was.
B
It
was
a
lot
cheaper,
our
system,
an
order
of
magnitude
cheaper
than
what
we're
seeing
out
in
the
market.
So
we
knew
we
were
up
to
something
good
there.
So
we
started
developing
a
remote
ceiling
solution
as
a
service,
and
you
know
we
commercialized
it
back
in
November
and
now
we're
commercially
sealing
four
PowerPoint
storage
providers
and
the
way
our
system
works
on
the
remote
ceiling
system
works.
Essentially,
is
we
built
a
remote
Bridge?
B
We
had
to
build
a
custom
bridge
that
can
separate
out
the
ceiling
pipeline
from
the
actual
storage
provider,
Miner
setup
and
it's
super
easy
deployment.
Essentially,
it
acts
just
like
a
local
worker
on
the
storage
provider
side,
and
so
you
install
our
Bridge.
You
set
some
some
custom
environment
variables,
whitelist
and
IP,
and
then
you
can
start
sending
jobs
to
our
compute
pipeline.
B
And
so
you
know
for
very
big
needs.
You
know
we
have
our
our
data
center
in
Ohio,
Canton
Ohio
power
is
great
over
there.
You
know
super
fast
internet
connection,
but
there
is
bandwidth
constraints
on
super
big
needs.
B
So
what
we
can
do
is
essentially,
we
can
co-locate
our
ceiling
clusters
in
your
data
center
and
then
we
can
basically
have
a
direct
line
connect
that
opens
the
door
for
a
lot
of
you
know
shared
Revenue
opportunities
plus
you
know
it
obviously
solves
the
bandwidth
constraints
and
then
just
kind
of
in
general
high
level
for
file
coin,
the
platform
Network.
So
what
we're
doing
is
we're
trying
to
accelerate
the
fileco
network
growth
and
so
by
decoupling,
sealing
compute
from
Storage.
B
What
we're
doing
is
really
helping
lower
the
costs
by
allowing
different
storage
providers
or
different
service
providers
to
specialize
in
what
they're
good
at
so,
for
example,
if
you're
a
storage
provider,
don't
worry
about
building
out
the
ceiling
pipeline,
stuff
just
focus
on
storage,
but
you
know
looking
at
high
level
about
2.5
billion
dollars
of
network
Hardware
has
been
invested
so
far
with
a
large
portion
of
that
going
towards
ceiling.
B
Compute-
and
you
know,
the
ballpoint
network
still
has
a
lot
of
growth
left,
and
so
a
lot
of
additional
investment
will
have
to
be
made
in
in
ceiling
compute,
and
so
this
is
where
separating
it
out
will
really
help
out
with
this
next
leg
of
growth
and
specifically
for
any
service
provider
or
user
of
a
compute.
In
this
case,
storage
providers
who
are
using
our
ceilings
of
service
the
benefit
to
them,
is
moving
the
business
model
away
from
and
invest.
B
You
know
upfront
investment
in
capex
to
a
pay-as-you-go
operating
expense,
and
so
you
only
pay
for
what
you
need
and
it
really
eliminates
the
the
problems
we're
seeing
out
there
with
storage
providers
who
have
invested
in
over
invested
in
ceiling
compute.
Now
it's
sitting
idle,
so
essentially
just
pay
as
you
go
and
pay
for
what
you
need.
B
Moving
that
so
moving
that
capex
to
Opex
and
and
basically
optimizing
that
will
allow
storage
providers
to
focus
on
storage
and
invest
more
in
storage
and
and
help
in
that
side
of
the
business,
and
then
also
you
know
it's
complicated
to
build
ceiling
and
compute
systems,
and
so,
if
you
are
a
storage
provider
or
in
general,
like
you,
should
focus
on
what
you're
good
at
so
again
with
the
specialization
concerns,
storage
providers
can
now
focus
on
doing
storage
and
building
out.
B
You
know
apis
for
the
client-facing
stuff,
whereas
they
don't
need
to
focus
on
building
a
ceiling,
infrastructure
and
and
again
that
really
helps
out
the
benefit
to
the
full
Network
and
everybody
who's
in
it.
B
Now,
where
are
we
today?
A
line
is
currently
running
a
large
Fleet
of
fpgas
in
Kenton
Ohio.
We
have
a
data
center
there
we
have
a
200
gigabyte
and
we
have
a
lot
of
chips
that
are
sitting
on
the
Shelf
ready
to
be
deployed.
So
we
can
build
out
different
types
of
compute
pipelines
for
different
use
cases,
and
so
we
are
actively
looking
and
building
different
types
of
compute
that,
based
on
what
we
see
out
there
so
yeah
it's.
B
You
know,
I
wanted
to
talk
to
you
guys
to
open
myself
up
to
communicate
as
you're
building
out
your
different
compute
use
cases
on
the
front
end
or
wherever,
however,
you're
building
it
out.
You
should
talk
to
us
because
we
could
we
should
work
together.
You
know,
if
there's
a
big
opportunity
there,
you
can
use
our
infrastructure
to
start
to
build
out
your
compute
pipelines
so
feel
free
to
reach
out
to
me.
B
You
can
always
get
in
touch
with
me
at
my
email
or
I'm
on
popcorn
slack
info's
right
there,
so
yeah
I'd
love
to
love
to
talk
and
open
it
up
to
any
questions.
A
If
someone
wants
to
engage
with
a
line,
is
there
any
typical
timeline
process
in
terms
of
let's
say
we
identify
some
compute
that
could
benefit
from
Custom
Hardware.
If
you
could
maybe
give
us
a
sense
of
what
the
ballparks
would
look
like,
how
do
you
do
it?
How
do
we
do
a
proof
of
concept?
How
does
that
eventually
get
into
larger
rollout
any
sort
of
timelines
around
that
would
be
super
interesting,
yeah.
B
The
only
thing
is
you
know,
as
you
know,
we're
all
resource
constraint
is
a
big
challenge
for
startups,
and
so
right
now
we
have
specific
bonuses
and
if,
if
it's
a
big
opportunity,
we
would
love
to
hear
about
it
and
and
work
with
you
now
the
timeline
it
could
be
weeks
to
you
know
like
we
could
actually
spin
up
something
in
a
week
or
two
to
some
thing
that
takes
a
bit
longer
depending
on,
depending
on
the
job,
so
I.
B
It
really
depends
on
each
individual
job,
but
I
encourage
you
to
reach
out
to
me
with
these
use
cases.
If
you
have
something
so
we
can
map
out,
you
know,
what's
the
opportunity
here
and
does
it
make
sense
to
build
something
out
for
it.
A
D
Ready
yeah,
just
a
quick
question:
I've
done
some
fpga
work
in
the
past.
Do
you
guys
allow
any
access
for
working
with
higher
level
languages
on
top
of
their
log
or
vdhl
like
heart,
camel
or
something
like
Clash
or
claspber
Haskell
like?
Is
there
ways
to
kind
of
kind
of
operate
more
with
the
with
the
software
and
kind
of
libraries
and
stuff?
On
top
for
for
programmers
who
want
to
operate
workflows
in
your
environment,.
B
B
A
Is
this
is
super
helpful
Scott,
like
you
know,
one
of
the
other
things
that
I
see
coming
along
the
the
path?
A
lot
is
you
mentioned:
ZK,
algorithm,
zero
knowledge,
proof
and
zero
knowledge,
proving
that
era
holds
a
lot
of
promise
for
folks
to
fix
a
lot
of
issues
around
private
data
sets
machine
learning
people
talk
about
using
ZK
for
distributed
learning,
and
things
like
that.
A
So
I'm
really
glad
to
hear
what
you
guys
are
working
on
in
that
space,
because
anything
we
can
do
to
overcome
these
Hardware
limitations
and
bring
more
of
these
use.
Cases
to
Market
I
think
just
helps
all
of
us
in
the
decentralized
compute
space.
So.
B
A
A
Very
good
all
right,
thank
you
so
much
well.
We
can
I
think
hand
it
over
unless
anybody
has
questions
to
to
our
next
guest
Alexander
forzanov
from
petals
ml.
If
you
can
hear
us
Alexander
yeah.
E
Yeah
sure
hi
everyone
thanks
Wes
for
inviting
me
to
give
the
talk
today.
So
let
me
let
me
share
the
slides.
A
E
Looks
good
yes,
yeah
great
great,
so,
okay,
let
me
start
so.
My
name
is
Alex
and
today
I'll
present
you,
our
new
system
called
petals.
This
is
basically
a
decentralized
platform
for
running
ledge
length.
Large
language
models,
large
I
mean
the
same
size
of
gpt3,
maybe
and
by
the
way
like.
If
you
have
any
questions
during
my
talk,
you
can
just
like
drop
them
in
the
chat
and
like
I'll,
try
and
try
to
answer
them
either
now
or
after
the
talk.
E
So,
okay,
let's
begin
so
just
to
give
you
a
quick
background.
I
think
most
of
you
have
already
tried
large
language
models,
maybe
as
GPT
3
or
like
charger,
PT
or
another
like
chatbots
and
so
on,
but
just
like
to
give
you
a
brief
background,
like
the
main
feature
of
large
language
models,
is
that
they
can
solve
many
language
processing
tasks,
not
only
like
chat
with
other
people
for
entertainment,
but
so
many
practical
language
processing
tasks
out
of
the
box
or
with
really
minimal
fine-tuning.
E
So
like
as
an
example
here
on
the
left
picture
we
have
like
so
imagine,
I
don't
know.
Maybe
someone
creates
a
chatbot
that
accepts
some
order
for
delivering
food
and
I,
don't
know
a
few
years
ago.
You
would
need
some
NLP
and
engineer
to
parse
this
sentence
in
English
in
human
language,
for
you
into
some,
like
Json
format,
that
you
can
later
process
in
your
like
database
in
your
delivery
system.
But
now
you
can
actually
give
like
a
few
examples
here.
E
It's
enough
to
give
just
one
example
of
what
you
want
to
do
with
the
input
text,
basically
to
convert
it
into
this
kind
of
Json,
and
you
get
you
make
the
model
to
convert
any
later
text
for
you.
So
you
provide
one
example
of
response,
and
it
does
it
later
automatically
and
often
you
don't
even
need
one
example.
E
You
can
just
describe
the
task
and
it
understands
what
it
should
do
and
just
generates,
for
example,
code,
a
line
of
code
or
maybe
like
a
longer
text
or
code
and
and
importantly,
the
language
models
can
be
used,
not
only
for
generated
texts,
but
you
can
use
their
internal
representations,
so
you
can
basically
cut
off
this
head
that
chooses
the
next
character
and
use
these
representations.
E
So
gpt3
is
175
billion
parameters
and
they
keep
growing
so
like.
There
are
now
like
models
closed
with
two
trillion
parameters
and
so
on
and
of
course,
like
At.
First,
all
these
models
were
just
proprietary,
so
companies
like
open
AI,
maybe
deepmind
have
them,
but
there
was
no
much
access
to
them
from
like
people
not
working
in
these
companies.
Maybe
maybe
we
had
like
some
apis
limited
apis
or
some.
E
We
could
try
some
limited
features,
but
we
didn't
have
full
access
for
research
for
like
implementing
different
stuff,
different
methods,
with
these
models
and
and
like
I
think
last
year
the
situation
has
changed,
because
multiple
large
models
were
released,
most
notably
Bloom
by
the
big
science
initiative
and
meta
AI
by
meta
or
x,
Facebook
and
several
other
models,
but
turns
out.
It
doesn't
really
change
situation
much
because.
E
Very
little
number
of
people
were
able
to
actually
run
these
models,
and
this
is
because
they're
really
huge.
So
basically
if
the
model
is
of
size
of
Jupiter
3,
so
have
something
like
170
billion
parameters.
You
need
at
least
this
number
of
gigabytes
of
a
GPU
memory
to
run
the
model
efficiently,
and
this
is
even
if
you
use
like
some
very
smart
compression
techniques.
So
if
you,
if
you
just
store
each
parameter
as
a
floating
Point
number,
you
will
get
like
twice
as
that
or
maybe
four
times
as
that.
E
So
anyway,
with
even
with
the
state-of-the-art
compression
methods,
you
can
only
run
it
if
you
have
like
really
lots
of
high-end
Hardware
such
as
maybe
a
three
a100
gpus,
with
lots
of
GPU
memory,
or
maybe
eight
of
more
consumer
grade
gpus
3090.
So
it
is
actually.
This
Hardware
is
very
expensive
to
buy
quite
expensive
to
rent
and
it
is
still
difficult
for,
like
independent
researchers
or
maybe
small
labs,
small
companies,
Universal
apps,
to
use
all
of
this
so
actually
letter.
E
Let's
take
a
look
of
what
options
did
we
have
before
petals?
So
actually,
okay,
like
you,
can
you
can
download
this
model,
assume
that
you
have
maybe
only
like
one
consumer
grade
GPU
or
maybe
one
high-end
GPU,
but
not
like
a
gbo
cluster.
So
one
method
you
could
use
if
you
you
could
like
turn,
take
one
machine
with
a
GPU
then
like
when
you
run
the
model,
basically
copy
model
blocks
from
your
disk
or
Ram
to
the
GPU
on
demand.
E
But
this
way
it
turns
out
to
be
like
extremely
slow
because
to
generate
even
one
token,
even
one
part
of
the
world.
You
need
to
go
through
all
70
Transformer
blocks
of
this
like
pledge
language
models,
so
you
basically
loaded
part
by
part
and
to
generate
even
one
word.
You
need
to
transfer
like
almost
200
gigabytes
to
the
GPU
memory
and
you
you
need
to
transfer
it
for
like
for
each
word,
so
this
turns
out
to
be
very
slow
if
you
just
even
if
you
just
copy
this
from
Ram.
E
If
so
in
this
case,
it
consumes
like
something
like
20
seconds
per
token.
So
definitely
this
is
not
close
for
anything.
You
could
use
to
like
make
some
interactive
inference.
Make
chat.
Bots
for
any
like
practical
applications-
that's
not
quick
enough!
Unfortunately,
and
it
gets
even
slower
if
you
don't
have
200
gigabytes
of
RAM
and
you
only
use
a
floating
from
s
his
team.
E
So
okay
and
another
obvious
option
is
that
you
could,
maybe
you
don't
even
have
a
GPU,
but
you
could
use
some
hosted
apis
from
open,
AI,
maybe
other
companies,
and
they
are
very
convenient,
of
course,
but
they
may
be
expensive.
E
You,
you
only
get
the
output
text,
but
you
can't
analyze
what
happens
in
the
model
under
the
hood.
So
that's
that's
where
basically
petals
comes
in,
so
we
suggest
new
options
here.
A
new
option
for
any
large
language
models,
if
you
don't
have
a
GPU
cluster
itself-
and
this
is
basically
by
collaborating
with
others-
collaborating
with
others
of
the
internet,
as
some
like
of
my
colleagues
say
in
the
style
of
my
BitTorrent
or
some
as
some
other
decentralized
projects.
E
So
the
core
idea
is
that
you
load
a
small
part
of
the
model,
then
team
up
with
people
serving
the
other
parts
to
run
inference
or
maybe
adapt
model
for
your
own
task
tasks
and
just
to
give
some
terms
here,
and
so
we
have
some
participants
called
servers
that
load
the
model
blocks,
for
example,
Bloom
blocks
or
maybe
any
other
large
LM.
They
load
them
to
their
gpus
and
they
kind
of
set
up
a
service
for
others
so
that
they
can
do
like
forward
and
backward
passes
basically
computations
through
these
blocks.
E
So,
basically,
maybe
maybe
we
want
to
send
requests
to
the
servers
holding
the
third
dot
of
the
model,
then
to
the
servers
holding
the
second
third
and
so
on,
and
it
may
come
actually
as
a
surprise,
because
I've
already
told
you
that,
like
Ryan
and
model
locally,
if
you
don't
have
enough,
gpus
is
like
very
slow,
and
here
we
we
now
don't
we
now
not
only
like
run
something
locally,
but
now
we
also
communicate
by
the
internet,
and
the
internet
is
comparatively
relatively
slow
Network
compared
to
some
like
local
networks
in
high
performance
clusters.
E
So
you
may
Wonder
like
why.
Why
is
it
faster
because,
like
internet
is
faulty,
it's
possibly
slow,
so
it
it
came
for
me,
it
came
as
a
surprise
that
this
way
actually
generation
is
at
least
like
10
times
faster
than
with
offloading
than
with
this
method,
I
told
you
for
writing
the
model
locally,
and
why
is
it
faster
turns
out?
E
This
is
just
because,
okay,
even
if
we
use
like
the
internet,
a
very
goal
and
possibly
fall
to
network,
we
do
not
send
much
data
this
way,
and
this
is
because,
in
the
offloading
case,
we
needed
to
like
constantly
send
huge
model
blocks.
But
in
our
approach,
every
peer
can
just
load
a
certain
part
of
the
model
like
to
their
gpus
and
then
just
like,
accept
small
requests
with
a
small
internal
representations
of
the
model
to
like
do
their
part
of
the
job.
E
So,
basically,
to
sum
up,
we
use
a
slow
network,
but
we
sent
thousands
time
less
data,
so
this
actually
works
faster,
even
though
some
pairs
may
leave
or
even
though
there
can
be
faults
related
to
the
unstable
internet,
for
example,
and
also
an
important
feature
is
that
you
can
actually
have
lots
of
control
over
over
the
model
that
you
didn't
have
like
in
the
hosted
API
case,
because
you
call
part
of
the
model
like
yourself
from
your
client
that
runs
locally
on
your
PC
and
you
can,
for
example,
if
you
want
to
take
a
look,
what
still
the
model
had
after
a
certain
block
or
maybe
insert
some
new
features,
insert
adapters,
maybe
change
the
order
of
layers.
E
Somehow
you
can
do
this
with
the
as
you
do
this
like
in
the
usual
in
the
usual
machine
learning
Frameworks,
and
this
is
something
that
is
not
possible,
usually
in
the
apis.
Okay,
let
me
let
me
take
a
look
at
the
question.
Is
the
history
available
for
transformation
models
throughout
if
its
life
cycle,
not
sure
I,
understand
the
question
correctly?
History
is
not
stored
by
almost
all.
E
Let
me
let
me
go
further
and
feel
free
to
ask
more
questions
like
if
you're,
if
I
didn't
respond
very
clearly
so
okay,
so
the
generation
is
fast.
But
here
is
another
problem.
Actually
it's
not
clear.
So,
basically,
all
the
peers
have
a
common
shared
model
and
what
if
they
want
to
adapt
the
model
for
their
own
tasks,
so
maybe
teach
a
new
language
for
the
model
or
maybe
teach
a
new
tasks.
So
basically
do
anything
that
implies
changing
the
weights
of
the
model.
E
Like
is
it
possible
and
it
turns
out
that,
yes,
basically,
even
if
we
assume
that
all
these
shared
blocks,
hostage
and
servers
are
constant
and
modern,
NLP
suggests
that
you
can
use
some
parameter.
Efficient
adapters
or
trainable
prompts
basically
a
very
small
additions
to
the
model
compared
to
the
like
this
size
of
the
pre-trained
model
to
adapt
the
model
for
like
most
real
world
tasks.
So
basically,
this
deeper
drained
model,
it
kind
of
stores
all
the
knowledge
about
the
world.
E
It
has
learned
from
the
internet
and
code
and
so
on,
and
you
can
add,
just
a
small,
very
small
adapter,
to
adapt
it
to
the
task
you
need
and
this
adapter
it
doesn't
need
much
memory
or
compute,
so
it
can
be
even
stored
locally,
even
on
CPU.
At
some
cases,
when
your
client
doesn't
have
a
gpus,
so
clients
can
store
adapters
in
prompt
locally
and
just
train
them
as
usual
neural
network,
using
like
usual
Frameworks
like
pytorch
you're
used
to
and
so
on.
E
These
small
additions
they
have
trained
and
import
them
from
each
other
and,
of
course,
inspect
some
intermediate
States
for
research,
which
is
important
for
ML
scientists,
okay
and
yeah
before
I
show
you
some
examples,
some
like
technical
stuff.
So
the
crucial
thing
is
that
we
use
leap
P2P
for
Alden
twerking,
so
this
allows
us
to
have
protocol
agnostic
into
working
code.
So
maybe
it's
like
some
peers
want
to
communicate
or
participate.
Maybe
the
others
want
to
communicate
over
quick
over
UDP,
because
this
is
like
more
convenient
for
UDP
hole
punching.
E
Basically,
the
only
way
we
can
communicate
with
UDP
all
punch
links
to
use
like
quick,
not
TCP,
and
we
can
actually
like
this
allows
us
this
allows.
So
if
someone,
for
example,
some
user
decides
to
join
petals
basically
provide
the
computing
power
of
their
GPU,
maybe
their
GPU
at
home
or
at
their
company.
E
E
Yeah
and
importantly,
the
whole
system
is
decentralized.
So,
like
we
don't
have
any
like
critical
point
of
failure,
you
can
you
can
even
like
a
couple
of
services.
I'll
show
you
later
you
can
just
clone
the
repository
and
set
up
a
replica
of
them
yourself
and
like
there
is
no
nothing
in
the
system
that
cannot
be
replicated
by
other
people,
except
maybe
the
bootstrap
beers.
That
way
like
this
bootstrap
addresses
that
we
have
like
in
ipfs,
so
yeah
and
importantly,
all
steps
are
fall.
E
Tolerant
so
of
course,
like
people
may
live
at
any
time,
maybe
they
want
to
use
their
GPU
for
something
else,
and
the
clients
will
always
be
able
to
find
another
server
holding
the
same
blocks
and
a
couple
of
words
of
how
to
make
this
efficient.
Basically,
we
need
to
compress
anything
so
that,
like
we
can
use
as
as
few
servers
as
possible
as
few
communication
as
possible.
So
we
compress
model
ways
with
different
latest
techniques:
don't
actually
store
them
in
floating
Point
numbers,
but
rather
in
some
quantized
state.
E
Well,
okay,
and
of
course,
there
is
some
like
plot
balancing
so
basically,
first
of
all,
if
some
servers
leave
like
imagine
in
this
picture,
like
turns
out
like
all
these
servers
live
and
we
have
a
gap
in
the
network.
The
other
servers
are
able
to
change
the
blocks
they
hold
to
close
this
Gap,
or
maybe
a
bottleneck
and
the
same
for
clients.
E
So
clients
may
choose
servers
to
optimize
latency
for
Generation
to
optimize
throughput
for
training
with
larger
batches,
because
throughput
is
much
more
important
than
latency
in
this
case
and,
of
course
like
there
is
some
randomization
so
that
we
don't
always
fall
into
the
same
path
and
we're
kind
of
trying
to
distribute
a
lot
over
the
network
and
now
I'll
just
show
you
a
couple
of
examples.
So
right
now
our
public
swarm
hosts
two
models.
So
this
is
a
Bloom
and
Bloom
Z.
E
These
are
basically
a
kind
of
publicly
released
models
that
are
similar
to
gpt3.
Maybe
they
are
of
the
same
size
as
jupiter3
and
Bloom
is
the
rec
is
a
regular
language
model
and
Bloom
Z
is
its
version
fine-tuned
to
follow
human
instructions?
So
basically,
that's
maybe
something
closer
to
chat
GPT,
because
the
standard,
mod
language
model
May
refuse
to
follow
your
instruction,
but
this
does
its
best.
So
yeah
we
have
two
of
such
models.
E
Many
servers
contributed
by
different
users
of
and
organizations
of
different
capacity
that
allows
us
to
like
form
the
full
chain.
So
basically,
both
of
these
models
have
70
blocks
and
the
client
when
it
wants
to
generate
something
or
train
an
adaptation
of
the
model.
They
choose
a
chain
through
all
these,
through
all
the
available
servers
given
the
conditions
they
want
to
optimize
and
then
run
the
model
with
maybe
like
a
decent
speed.
So
our
speed
is
something
like
one
token
per
second.
E
It
is,
of
course
not
not
as
strong
as
charge
PT,
because
we
didn't
spend
that
much
time
on
the
modern
side,
but
still
it
runs
to
run
this
Bloom
Z
Model
the
model
fine
tune
to
follow
the
human
instructions
and
see
its
outputs
in
real
time
as
soon
as
it's
it
is
generated
by
the
Swarm.
So
it
should
work
with
one
or
two
seconds
per
token.
Sometimes
it's
slower
because
sometimes
we
have
lots
of
clients
and
of
course
we
need
more
servers
joining
to
handle
increased
Lots
properly.
E
But
anyway
this
works
and
without
battles
like
you
couldn't
you
couldn't
actually
try
balloon
Z,
because
it's
not
available
yet
in
most
inference
apis
so
and
like
inference
with
offloading
is
like
20
seconds
per
token.
So
this
allows
many
people
to
to
try
this
model
now
and
if
you
are
an
ml
engineer,
a
machine
learning
engineer,
it
is
important
that
we
made
battles
interfaces
loop
so
that
we
made
battleless
interfaces
so
that
petals
is
no
more
difficult
to
use
than
running
on
a
small
model
on
your
PC.
E
So,
basically,
all
these
interfaces
are
very
similar
to
a
popular
hug
and
face
Transformers
library
and
you're
kind
of
just
create
another
models.
And
if
you're,
just
an
ml
engineer,
you
don't
need
to
know
that,
like
under
the
hood,
there
are
like
huge
amount
of
algorithms,
all
the
lipidopy
stack
and
so
on,
like
lots
of
going
on,
and
for
you
it's
just
like
the
usual
model,
you
run
so
basically,
that's
mostly
it
some
points
about
the
future
work.
We
may
do
so.
E
We
consider
introducing
rewards
for
hosting
servers
because,
of
course
there
may
be
some
imbalance,
for
example,
if
it
turns
out
that
many
people
made
some
client
applications
for
battles,
but
not
many
people
make
decide
to
like
join
as
servers
so
yeah
we
wanna
to
introduce
some
rewards
that
will
serve
people
hosting
servers
will
then
be
able
to
use
them
on
high
priority
inference,
and
maybe
some
extra
features
like
increase
Tech
flanks.
We
do
not
consider
like
any
crypto
at
the
moment.
E
It
will
be
just
like
a
simple
reward
system
because,
like
our
system
get
all
this
ml
focused
so
but
anyway,
I
hope
this
will
motivate
some
people
to
contribute.
Also,
we
wanted
to
make
like
a
leaderboard
of
who
contributed
the
most,
so
people
can
kind
of
advertise
their
companies-
maybe,
for
example,
you
own
a
GPU
hosting,
and
you
can
like,
provide
a
couple
of
GPU
and
advertise
its
name
in
our
leaderboard.
E
If
they
cheat,
we
can
like
ban
them
and
maybe
and
make
some
subtract
some
amount
of
their
points,
and
so
on
and
I
I
would
say
that
the
most
important
downside
of
our
system
compared
to
other
approaches,
is
privacy,
because
peers
in
the
Public's
form
May
recover
parts
of
your
data.
That's,
unfortunately
how
it
works
right
now,
of
course,
there
are
like
a
lot
of
stuff
about,
like
some,
you
know:
multi-party
computations,
like
maybe
ziki
ZK
proofing,
and
that's
more
for
security
and
so
on.
E
But
unfortunately
this
does
not
work
for
machine
learning
for
large-scale
machine
learning
well
yet
because
it
usually
involves
like
10x
or
100x
slow
down
due
to
the
fact
that
these
methods
usually
work
with
some,
like
you
know,
integer
discrete
mods
and
like
machine
learning,
is
a
completely
different
field.
You
calculate
everything
in
floating
Point
numbers
and
like
these
algorithms,
these
NPC
algorithms
they're
not
translated
well
to
these
floating
numbers.
E
So
you
have
lots
of
Converse
versions
so
anyway,
like
we,
we
think
that,
unfortunately,
it's
not
practical
to
apply
any
known,
MC
methods
here
yet,
but
maybe
some
methods
will
appear
in
the
future
and
for
now
we
we
just
just
set
up
if
you
like,
if
you
still
wanted
to
use
pedals.
For
example,
you
are
in
some
Universal
app
that
wants
to
process
some
private
data.
You
can
set
up
a
private
swarm
between
organizations.
E
You
trust,
maybe
I,
do
not
join
with
another
small
company,
another
Universal
app
and
you
can
easily
set
up
a
private
swarm,
basically
a
private
Network,
where
all
the
data
will
be
processed
among
the
viewers.
You
trust,
so
it
won't
be.
A
security
and
privacy
won't
be
issues
for
you.
So
basically
that's
it.
That's
it.
You
can
try
follow
check
out
our
website.
We
have
like
GitHub
docs
tutorials
everything
there
also
paper.
If
you
want
to
dig
in
into
technical
details
and
feel
free
to
ask
any
questions
now,.
A
Alexander,
this
is
amazing,
huge,
huge
fan
of
the
work
that
you
guys
have
built.
I
mean
in
in
so
many
ways.
I'll
I'll
start
with
one
question
and
I
have
a
lot
more,
but
I'll
give
some
other
folks
an
opportunity
to
weigh
in.
Could
you
share
a
little
bit
about
when
you
were
starting
the
network
of
of
compute
providers
available
in
your
network,
people
that
are
actually
running
the
pedal
software?
A
Could
you
talk
a
little
bit
about
how
you
grew
that
network?
Was
it
simple
word
of
mouth
or
do
you
have
any
more
sort
of
intentional
way
of
growing
that
and
maybe
what
you
would
like
for
it
to
become
in
the
future.
A
E
Sure
so,
basically
it
grows
at
some
rate
for
now.
So
basically
we
got
some
initial
visibility
because,
like
Bloom
is
a
very
popular
model
and
like
indeed
many
people
like
indeed
running
such
large
models
were
an
issue
for
many
people,
so
we
got
some
visibility
on
the
live
ml
subreddits,
maybe
ml
Twitter,
Hacker
News,
so,
like
some
people
just
came
to
try
it
out
some.
E
It's
not
difficult
for
some
people
like
to
host
a
couple
of
servers
because
they
may
have
a
couple
of
spare
gpus
but
of
course
like
we
do
not
grow
grow
with
a
very
quick
rate
yet
and
that's
why
we
want
to
work
on
like
introducing
incentives
as
fast
as
possible,
so
that,
like
people,
are
more
motivated
to
do
that.
However,
like
I
think
still
now,
the
growth
rate
is
not
0,
I.
Suppose
it's
like
something
similar
to
BitTorrent.
E
A
You
yeah
great
yeah.
It's
a
testament
to
the
to
the
interest
right
now
in
the
in
the
language
models
that
you
guys
are
serving.
C
B
What
is
the
kind
of
like?
What's
the
current
traction
that
you're
seeing
and
when
you
look
at
it,
I,
don't
know
you're
two
years
out
like
what
stopped
you?
How
do
you
quantify
it?
For
you
know
a
team,
that's
looking
to
do
something:
that's
more
commercial,
more
scaled
and
like
less
on
the
science,
fair
project
side.
E
Yeah,
so
we're
actually
a
research
team
saw
like
petals
is
not
a
startup.
We
we're
just
like
researchers
at
a
lab,
and
so
we
decided
to
make
this
project
with,
like
other
researchers,
from
the
big
science
collaboration,
basically,
collaboration
that
unites
all
the
researchers
from
lots
of
different
universities
and
companies,
and
so
we
actually
don't
want
to
monetize
it
don't
plan
to
monetize
it
to
get
any
profit
ourselves
for
now.
E
So
our
I
think
our
primary
focus
for
now
is
to
just
make
some
kind
of
self-sustainable
system,
like
maybe
BitTorrent
or
Bitcoin.
So
maybe
maybe
you
will
move
on
to
any
of
the
other
project
or
company,
but
the
system
will
live.
Maybe
someone
else
will
be
able
to
like
contribute,
and
so
on
so
yeah.
We
don't
have
like
explicit
plans
for
monitor,
Asian
and
only
want
to
add
incentives
so
that
like
to
balance
the
demand
for
the
servers
and
Supply,
but
these
incentives
they
are
not
like
planned.
Like
you
know,
they
are
not.
E
B
B
E
B
Do
you
get
a
team
like
our
team
to
develop
a
pipeline
for
this
to
offerings
you
know
to
for
the
reward
like?
Can
you
define
like
okay
in
a
year
or
two
years?
We
think
it
could
be
this
big
because
of
this
inflection
point
and
I
think
someone
to
ask
what
universities
you're
working
with
maybe
that's
kind
of
the
path.
Is
there
a
partnership
or
something
that
you
need?
That
would
cause
that
growth.
E
Yeah
yeah
so
right
now
yeah
we
we're
thinking
about
different
like
ways
you
know
to
spend
to
award
points
so
yeah
thanks.
Thanks
for
the
suggestion,
so
yeah
I
guess
we
we
need
to
like
take
a
look
at
different
approaches,
because
there
are
like
lots
of
similar
networks
awarding
someone
for
computations
out
there
so
so
yeah,
but
for
now
I
can't
like
say
any
any
specific
plans,
it's
very
cool!
Thank
you.
C
E
So
a
question
from
Irina:
yes,
so
yeah,
where
I'm
based
I'm,
currently
in
Armenia,
so
yeah,
my
University
so
yeah.
We
are
actually
like
a
research
lab
in
Yandex
in
the
Yandex
company
and
but
we're
kind
of
very
independent,
because
we
don't
do
anything
for
business.
E
We're
mostly
focused
on
like
something
and
contributing
papers,
and
we
did
this
project
with
people
from
many
other
companies
like
on
the
first
slide,
like
people
from
also
hug
and
face
that's
a
well-known
company
in
ml
in
NLP
field,
also
from
people
from
with
people
from
University
of
Washington
and
so
on.
Yeah
yeah,
thanks
for
the
link
yeah
with
people
from
big
science.
Basically.
C
E
Sure
Yeah,
so
basically
I
think
our
top
priorities
now
is
to
so.
First
of
all,
we
want
to
finish
this
incentive
system
because
it
is
important
to
like
match
the
demand
to
to
handle
the
growth
more
naturally,
also
we
wanted
to
do
like
some
technical
optimizations
I'll
move
to
this
slide.
So
basically,
there
are
a
huge
hormone
in
Improvement,
the
road
and
algorithms,
so
basically
choosing
the
past
among
servers,
so
that
works
as
fast
as
possible,
and
also
we
can
consider
like
implementing
some
stuff
like
tensor
parallelism.
E
So
basically,
if
you
have
one
machine
with
a
few
gpus
right
now,
you
need
to
like
run
and
battle
servers,
but
you
will
be
able
to
run
one
tensor
parallel,
tensor
pedal
server
that
will
be
like
n
times
faster.
So
basically,
this
also
will
decrease
the
inference
time,
so
so
yeah
I
think
that's,
basically
it
so
incentives
and
some
work
on
making
the
network
more
fast
and
more
stable.
C
Are
there
any?
Is
there
any
understanding
of
the
like
asymptotic
limits
of
how
well
this
could
perform
relative
to
like
models
being
run
all
in
one
location,.
E
Yeah,
so
there
is
something
you
can
cannot
surprise
if
that,
like.
Basically,
you
can
optimize
the
computations
that
are
done
like
inside
one
inside
one
server,
but
you
cannot
optimize
much
the
latency
between
servers
so
basically,
for
example,
right
now
for
Bloom
Z,
you
will
need
three
Hops
at
best.
E
So,
basically,
if
you
choose
the
long
server
from
like
0
to
16,
then
this
server
and
This
Server
and
you
can
optimize
something
that
happens
inside
a
server
but
the
latency
between
them
stays
and
I
think
this
can
also
be
optimized
to
some
point,
for
example,
by
choosing
servers
geographically
closest
to
you.
So,
for
example,
I
imagine
that
if
the
network
grows
will
have
servers
in
America
in
Europe,
maybe
in
other
continents
as
well
and
the
clients
will
be
able
to
choose,
allows
the
servers
to
like
minimize
all
this
latency.
C
Also
are
there
substance,
are
there
like
what
are
the
main
differences
between
training
and
inference
in
these
models,
and
also
you
did
go
over
trading
somewhat,
but
it
would
you.
It
was
also
those
models
weren't
being
trained
from
scratch.
So
is
it
also
so
what's
the
story
with
training
models
from
scratch.
E
Yeah
so
actually,
like
our
research
group,
started
from
like
approaching
how
to
train
very
large
models
from
scratch,
because
at
the
I
think
they
started
in
like
2020
and
I
joined
soon
as
well,
and
at
that
point
there
was
like
GPT
3,
but
there
were
no
publicly
released
models
of
this
size.
So
we
thought
that
maybe
maybe
we
need
to
make
a
system
for
train
them
for
from
scratch
to
to
get
them.
E
But
actually,
like
turned
out
that
this
system,
like
we
made
a
library
of
this,
it's
called
hive,
mind
and
actually
battles
uses
it
a
lot
under
the
hood.
So
like
we
built
on
our
previous
work
but
have
mine
in
my
opinion,
and
it
turned
out
not
as
popular
because
to
train
large
models
from
scratch.
You
need
like
lots
of
ml
expertise
like
much
more
than
like.
You
need
to
just
like
fine
tune
it
for
something
or
run
a
chart
and
so
on,
and
it
turns
out.
E
There
is
not
too
many
people
out
there
who
don't
work
for
a
sub
company
yet
and
who
wants
to
like
create
these
large
models,
and
it
was
much
easier
for
them
to
find
some
funding
find
some
like
donations,
basically,
then,
to
set
up
this
complicated,
distributed
training
system
that
like
uses
sleep,
b2p
and
many
other
things.
So
basically,
I
have
mentioned
that
to
be
much
more
complicated,
though
you
can
still
use
it
to
to
train
like
models
from
scratch,
maybe
not
100
billion
plus,
but
I.
E
Think
like
5
or
10
billion,
definitely
works
yeah
and
so
yeah
yeah
a
lot
harder
for
open
sources.
Alexandra
writes
in
the
chat
and
yeah
and
as
for
the
difference
from
between
training
and
between
inference
and
fine
tuning,
so
like
right
now.
Basically,
an
important
difference
is
that
you
usually
do
inference
like
for
one
sequence
or
for
a
few
sequences
in
parallel,
and
you
usually
do
training
for
for
large,
very
large
batches
of
examples.
E
Latency
is
not
important
for
training,
because
throughput
is
much
more
important
because,
like
your
transfer,
lots
of
data
then
process
a
lot,
a
huge
batch,
then
transfer
them
and
so
on
and
and
yeah
and
for
instance,
another
important
consideration
is
that
it
consumes
memory
because,
like
people
for
example,
if
you're
talking
to
like
our
chat,
our
chatbot,
you
need
like
while
you're
talking
all
the
servers
in
your
inference
chain
should
store
all
the
previous
like
States
for
the
previous
tokens,
so
that
you,
the
Transformer,
can
attend
to
it.
E
So
it
can
look
back
and
understand
what
you
were
talking
about
and
so
on.
So
inference
consumes
some
memory,
that's
what
called
cache
in
this
table.
So
basically
it
consumes
some
memory
and
you
need
you
need
to
allocate
some
memory
like
for
long
periods
of
time
to
like
allow
to
do
inference.
A
Alexander
one
last
question
here:
I
know
we're
getting
a
little
closely
at
a
time,
but
I
just
want
I'm
super
interested,
the
the
checking
the
the
validation
that
your
team
is
doing
for
potential
cheaters
in
the
system.
Could
you
just
give
a
high
level
perspective
when
you're
thinking
about
designs
for
looking
for
malicious,
behavior
or
cheating
Behavior?
Do
you
do
you
try
to
approach
it
in
a
programmatic
way,
or
is
it
somehow
more
of
a
manual
task?
That'd
be
very
helpful,
because
a
lot
of
other
projects
are
concerned
with
similar
problems.
Yeah.
E
Sure
so
our
security
system
that
we
want
to
start
with
this
is
definitely
not
like
a
silver
bullet.
E
It
won't
be
perfect,
but
I
think
it
will
be
able
to
like
catch
like
some,
maybe
at
least
Hardware
failures
or
people
who
didn't
spend
like
multiple
weeks
to
dig
in
in
all
our
stack
and
so
on,
and
what
we
want
to
do
is
basically
to
just
make
this
validators,
who
pretend
to
be
clients
and
send
different
kinds
of
requests
to
the
servers,
and
then
they
like
know
the
correct
answer
in
advance
because,
like
we
may
pre-calculate
it
or
maybe
we
can
calculate
it
using
like
a
subset
of
trusted
servers
because,
like
there
are,
there
are
a
couple
servers
hosted
by
us
in
this
scheme
so
and
yeah.
E
We
we
can
compare
that
and,
for
example,
if
you
assume
that
there
is
present
as
some
malicious
server
that
maybe
not
all
the
time,
but
with
with
a
certain
probability
response
with
incorrect
data,
it
will
be
catched
sooner
or
later,
like.
Of
course
it
will
be
able
to
harm
to
a
couple
of
people,
but
it
will
be.
It
will
be
quite
personal
later.
E
So
that's
I
think
good
enough
to
start
and
like
for
increased
security
guarantees,
the
clients
will
be
able,
so
one
thing
you
can
do
is
just
to
run,
for
example,
inference
through
multiple
chains
through
like
two
disjoints
sets
of
servers,
so
that
you
can
like
double
check
your
result
and
also,
of
course,
like
for
100
guarantees.
You
you
need
to
use
a
private
swarm
in
this
scheme.
Unfortunately
yeah.
Unfortunately,
there
is
no
like
good
proof
of
work
for
neural
networks
at
the
moment,
so.
A
A
And
if
not,
we
will
thank
you
so
much
for
joining
both
you
and
Sally.
This
was
tremendous
content,
much
appreciated
I'm
gonna
post,
this
recording
to
the
slack
Channel
soon
under
the
computer
data
working
group.
But
if
people
want
to
go
to
cod.cloud,
they
can
get
a
link
to
the
slack
Channel
there
and
obviously
spently
Alexander
we'd
love
to
have
you
guys
join
us
there
for
any
any
future
conversations,
but
just
to
wrap
up.
A
Thank
you
so
much
for
taking
time,
love,
love
what
you
guys
are
building,
and
we
appreciate
you
for
joining.
E
Yeah.
Thank
you.
Thank
you
for
inviting
me
like
protocol
Labs.
We
use
a
lot
of
your
stuff
and
I.
Think
like
it's,
we're
very
happy
that
you
invited
us.