►
From YouTube: Recap on Hot Chips Conference - August 24, 2020
Description
Kevin Hunter does a recap on the Hot Chips 32 conference, which was held virtually on Aug 16-18. He highlights presentations and talks that focus on the latest processor innovations and machine learning processing.
A
The
premiere
of
local
conferences
on
what's
happening
in
high
performance,
microprocessors
and
integrated
circuits,
so
it
started
in
1989
it's
held
once
a
year
in
august
in
silicon
valley,
tutorials
talks
and
posters.
They
had,
of
course,
to
go
virtual.
This
year
there
were
1250
registrants.
Last
year
they
doubled
practically
this
year,
so
there's
some
hot
stuff
there,
so
the
tutorials
they
had
two
sessions
on
sunday,
one
was
machine
learning
scale
out
which
I'll
talk
a
little
bit
about
the
other.
A
The
first
two
topic
areas
actually
corresponded
with
are
on
monday
morning
and
I
didn't
deem
them
important
enough
to
miss
research
and
stand
up,
but
the
keynote
was
kind
of
interesting
and
I
had
to
actually
look
at
the
the
recorded
version
afterwards,
because
I
also
had
a
conflict
for
that.
One
and
I'll
talk
a
little
bit
about
that.
So
raja
kaduri
cedar
vice
president
intel
graphics
and
it's
got
a
bunch
of
titles
there,
but
the
title
of
talk
was
no
transistor
left
behind.
A
A
Yes,
yes,
I'm
going
to
touch
touch
on
the
fpgas
reconfigurable
architecture.
Actually
the
fpga
ones
were
not
so
much
interesting,
but
the
reconfigurable
architecture,
one
one
was
so
the
ml
training
and
ml
inference.
Even
though
I
went
to
it,
I'm
not
actually
going
to
talk
directly
about
those
today
in
the
interest
of
time.
A
Considerable
amount-
actually,
in
fact
almost
all
of
it
in
in
some
sense,
was
related
to
it
in
the
sense
that
ai
machine
learning
is
producing
such
a
driving
hardware
to
such
an
extent
that
everyone's
kind
of
looking
at
that.
So
on
the
server
side,
there
was
some
talk
about
that
and
but
on
the
mobile,
even
on
the
mobile
processing.
A
It's
like
how
do
you,
what
kind
of
ai
features
are
being
added
into
the
things
that
are
being
put
into
your
cell
phones
and
such
so,
but
I'm,
I'm
mostly
was
going
to
concentrate
talking
about
where
people
where
they
see
the
future
to
be
and
where
sparsi
cropped
up
very
few
of
the
major
players
actually
are
pushing
sparsity
the
one
exception
might
be
google
and
their
excuse
me
nvidia
their
a100
architecture,
but
it's
only
fine-grained
sparsity
two-by-two
blocks.
A
That
kind
of
thing,
of
course,
there's
also
cerebros,
which
I'll
talk
about
a
little
bit
later.
They
have
sparsity
to
some
degree,
but
they
didn't
get
into
that
heavily
as
to
to
what
level
sparsity
they
actually
support.
A
There
are
also
three
posters
which
I
think
we're
they're
reading
when
I
get
a
pointer
to
where
I
can
put
it
up
on
the
google
drive
I'll
put
up
all
the
the
pdfs
for
all
this
stuff.
There
they're
also
recorded
talks,
but
it
requires
a
login
to
get
to
them
and
if
anyone
you
know,
is
interested
in
actually
hearing
the
talks
behind
these
things,
I'll
try
to
make
that
available
to
you
in
some
way.
A
I
can
just
you
know,
share
my
my
credentials
and
tell
you
how
to
navigate
to
it,
but
these
these
three
talks
kind
of
looked
interesting
because
they
were
related
to
sparsity
and
in
particular
the
gampu
one
looks
like
they
were
trying
to
to
deal
with
sparse
bars
to
a
limited
extent,
but
they
were
dealing
with
it
serially
rather
than
in
parallel.
A
Okay,
so
the
the
first
tutorial
area
was
the
machine
learning
scale
out
so
in,
in
their
parlance
scale
up
means
more
intense
compute
on
the
chip.
So
when
you,
you
know
double
the
transistor
count
or
you
try
to
be
more,
you
know
pack
more
functionality
on
there.
That's
what
they
mean
by
scale
up,
so
in
trying
to
kind
of
get
to
moore's
laws
equivalents.
That's
one
way
of
doing
it.
A
The
typical
problem
there
is
that,
is,
you
have
a
power
barrier
that
they
they
can
only
scale
up
by
basically
keeping
to
a
certain
size
chip
and
this
packing
more
onto
there.
You
used
to
be
able
to
up
the
clock
frequency
and
that's
just
too
prohibitive
power
wise
these
now
you'll
melt.
The
chip
scale
out
means
distributing
the
problem.
So
one
of
the
things
that's
becoming
more
and
more
prevalent
now
is
rather
than
just
having
a
single
chip.
A
You
have
a
what's
called
a
silicon
interposer
which
is
kind
of
like
a
piece
of
silicon
with
kind
of
mating
wires
on
it,
and
you
drop
little
chiplets
on
there
to
add
the
functionality
in
some
cases,
they're
all
symmetric
some
cases.
You
have
heterogeneous
chips,
so
you
can
mention
a
max
capable
mix
and
match
capability.
B
A
Yeah,
oh
it's!
No!
I
mean
your
your
xilinx
chips
already.
I
have
that
right
now
I
mean
it's.
It's
it's!
It's
a
it's
a
that
technology
has
okay,
all
right
for
these
ten
years,
yeah!
It's
it's
just
a
question
to
the
degree
that
they're
relying
upon
it
now
because
the
the
big
problem
is
that,
as
as
you
make
the
chip
larger,
your
chance
of
hitting
a
critical
defect
is,
is
greater.
So
the
idea
is
that
okay,
we
box.
B
A
Yeah,
that's
not
even
a
chiplet,
that's
they
just
went
to
the
entire
wafer.
You
know
monolithically
across
the
whole
thing
so,
rather
than
piecing
together
a
good
dye,
they
just
route
around
the
ones
that
have
failed.
A
Multiple
boards
per
rack
slot,
where
they
kind
of
slide
in
those
drawers,
and
then
you
know,
multiple
racks
and
now
we're
talking
warehouse
level
compute
where
the
entire
building
theoretically
could
be
dedicated
to
one
problem,
though
in
practice
that's
not
the
case,
one
of
the
other
things
that
I've
noted
is
that
they're
starting
to
put
computation
network
switches.
A
So
the
communication
paths
between
all
these
servers
now
are
it's
not
just
smart
they're
actually
doing
some
forms
of
computation
on
them
in
a
way
of
either
compression
or
some
kind
of
processing
on
the
way
to
some
other.
A
A
So
most
of
the
talks
were
actually
case,
studies
and
scale
out.
So
unless
you
want
to
you
know
someone's
interested
in
the
hardware
details
of
how
they
approached
it,
I
didn't
think
that
would
be
worth
covering
this
talk.
So
the
one
of
the
talks
that
I
thought
was
worth
looking
through,
at
least
the
slides
of
is
the
fundamentals
of
scaling
out
deep
learning
training.
They
did
a
very
good
presentation
of
what
types
of
operations
you
need
when
you
start
scaling
out.
A
You
know
beyond
a
chip
or
a
board.
What
kinds
of
parallelism
do
you
can
you
do?
I
mean
the
first
thing
everyone
does.
Is
data
parallelism
where
you
have
70
units
that
do
you
know
a
vector
multiplies
in
one
parallel
shot,
but
at
some
point
you
run
out
of
capacity
on
whatever
chip
or
board
that
you
have,
and
so
then
you
start
subdividing
in
the
model,
so
you
can
either.
A
If
we
think
of
our
ideal
models,
where
you
know
they're
you,
what
you
would
do
is
take
a
piece
of
the
model
and
execute
it
all
the
way
through
on
one
ship
and
then
subdivide
that.
So,
if
you
have,
you
know,
you
know
a
thousand,
you
know
size
vector
as
a
as
a
kind
of
tried
example.
A
You
might
divide
it
up
into
pieces
of
a
hundred
and
then
execute
the
whole
model,
including
boundary
conditions
through
the
whole
thing,
and
then
we
combine
it
at
the
end,
the
other
one
is
just
pipelining,
taking
each
layer
and
operating
on
a
different
chip,
and
then
shipping
the
intermediate
results
between
shift
to
chip
and
then
all
combination
of
these
above
now
what's
interesting
about.
A
That
is
that
when
you
look
at
how
you
want
to
recombine
the
results,
there's
various
parlances
of
these
map
reduce
operations
where
you
would
sum
them
all
in
in
one
case
where,
if
you
split
up
a
matrix
multiply,
there's
he
basically
goes
into
what
types
of
these
mapreduce
operations
you
would
use
that
you
call
it
a
higher
level
both
for
the
forward
pass
and
sometimes
it's
different
for
the
backward
pass,
and
I
I
think
it
was
a
really
good
presentation
of
that.
A
It
opened
my
eyes
to
what
was
possible,
so
here's
cerebrus
they
actually
had
a
talk
in
outside
of
the
tutorials,
but
it
was
basically
a
more
or
less
reiteration.
A
So
numenta
has
had
a
relationship,
obviously
with
cerebrus.
So
this
is
a
way
for
scale
engine.
That's
you
know
the
center
section
out
of
a
300
millimeter
chip.
Excuse
me
a
wafer,
so
I
mean
those
are
the
stats.
I
mean
there's
incredible:
400
000
processing
elements
they
claimed
that
they
experienced
no
delays
talking
from
one
corner
of
the
chip
of
the
wafer
to
another.
A
So
whatever
they've
done
to
design
the
thing,
they
don't
consider
communication
to
be
a
barrier
that
it
takes
you
longer
to
access
corner
to
corner
than
adjacent,
so
they've
designed
it
to
kind
of
make
to
to
balance
all
that,
but
1.2
trillion
transistors.
A
You
know
the
amount
of
memory
I
mean
these
are
the
stats
here.
The
interesting
thing
is,
as
I
mentioned
earlier,
they've
supported
some
form
of
sparsity.
They
just
didn't
get
into
it
a
lot
in
in
this
talk,
but
they
also
provide
examples
similar
to
what
I
was
showing
to
you
before
about
a
data
parallel
model
parallel,
the
two
forms
of
it.
They
actually
show
how
they
can
map
those
onto
with
their
compiler
onto
pieces
of
the
of
the
or
processors
of
the
chip.
B
Kevin
yeah,
typically,
this
kind
of
thing
has
a
problem
with
heat
dissipation:
are
they
relying
on
sparsity
to
reduce
heat,
or
is
that
some
reason
not
a
problem.
A
B
A
Yeah
they
didn't,
if,
if
they,
if
they
did,
I
didn't
I
didn't
twig
on
it.
But
the
thing
is
is
that
there
is
a
notion
when
you
talk
about
this
kind
of
level
of
integration.
There
is
a
notion,
what's
called
dark
silicon,
where
they
can't
afford
to
turn
on
everything
on
on
any
one
of
these
chips
and
they
turn
off
pieces
when
they
don't
need
them,
and
it's
possible
that
you
know
that
sparsity
gives
them
a
leg
up
on
that,
but
they
also
you
know
if
they
ratter
around
something.
A
Maybe
they'll
use
pieces
of
each
of
processors
they
go
through,
but
it
is
definitely
a
huge
problem.
If,
if
this
entire
thing
was
dissipating
power,
it
would
melt
so
yeah
good
point.
So
take
a
take
a
look
at
this.
Can
you
see
my
cursor
up
here
by
chance?
A
Okay,
so
take
a
look
at
these
stats
and
I'm
going
to
show
you
the
only
they
would
say
about.
They
have
another
generation
coming
down
the
pike?
A
Okay,
basically,
it's
2x,
so
they're
going
to
be
going
on
the
most
advanced
tsmc
process
and
they
were
kind
of
cagey
as
to
when
this
might
actually
be
when
they're
gonna
actually
do
the
reveal
of
this.
But
what.
B
A
No,
the
the
the
the
first
version
the
first
generation
is
available
as
on
servers
that
can
be
accessed
publicly.
I
believe.
B
Well,
I
guess
the
question
is
that
being
used,
has
it
been
proven?
You
know
it's
like
you
know,
google
ordered
a
hundred
thousand
of
them
are
always
it's
more
like
hey.
You
can
try
this
out
type
of
stuff.
A
I
don't
I
don't
know
their
deployment
level,
I
can.
I
can
try
going
through
the
the
slides
and
get
an
answer
to
your
question.
Yeah.
A
B
Question
is
okay.
This
is
such
a
bold
move.
To
do
this,
I
mean
way
for
scale
has
been.
The
idea
has
been
around
for
a
long
time,
and
but
you
know
this,
these
people
make
a
lot
of
noise
about
it.
The
question
has
it
proven
itself
commercially
that
such
that
people
are
welcome.
It's
like
still
like
we're.
A
Maybe
it's
a
pollyannish
a
look
at
this
thing,
but
I
mean
they
could
just
be
sucking
down
investment
dollars,
but
the
fact
of
going
to
a
generation
two
indicates.
A
B
But
it
was
just
well
it
sort
of
reminds
me
of
the
you
know
the
stuff
that
the
human
brain
project
was
doing
the
car
holdings
myers
project
in
germany.
You
know
they
had
these
things.
It
was
available.
Researchers
were
working
with
them,
but
you
know
you
could
log
on
and
use
them,
but
it
wasn't
just
like
commercially
useful
yet
and
so
there's
a
big
difference
between
those.
I
don't
think
we
have
to
spend
more
time
on
it.
A
Okay,
so
the
other
talk
I
found
interesting
was
talk
from
google
with
they're
talking
about
g-shards,
so
the
concept
of
sharding
is
anywhere
anytime.
You
take
a
problem
in
machine
learning
or
whatever
in
some
high-performance
computation
and
basically
break
it
up
into
pieces,
and
then
scat
distribute
those
pieces
across
physical
boundaries
to
say
other
chips
to
other
boards
to
other
servers
and
then
recombine
and
pull
the
results
back
so
so
they
they.
A
Basically
this
this
talk
was
describing
the
language
they
had
for
dealing
with
that
and
how
at
google
it
it.
They
have
a
flow
from
defining
a
model
in
intense
overflow
and
working
it
down
through
an
optimization
layer
down
through
their
optimizing,
compiler
and
then
deploying
outwards
requires
partial
annotation
with
pragmas
in
the
code,
to
kind
of
give
an
indication
of
how
you
want
it
to
break
apart.
But
there's
a
lot
of
that.
That's
that's
automated
and
that's
what
they're
kind
of
highlighting
I
mentioned.
A
Also,
a
tens
torrent
is
a
company
I'll
talk
more
about
them
later
and
there
have
an
express
goal
for
automatic
process,
starting
from
just
a
pi
torch
api
and
then
distributing
things
all
the
way
across
infrastructure
of
whatever
you
have
available.
A
So
I
think
there
people
are
trying
to
lower
the
barrier
to
handling
larger
problems.
One
of
the
reasons
why
is
this
was
a
slide
I
pulled
out
of
one
person's
talk
of
these.
Are
these
transformer-based
guys
that
we're
starting
to
show
an
interest
in,
and
you
can
kind
of
see
that
this
is
exponential
growth
of
this,
so
so
around
in
2021,
we're
going
to
actually
already
start
to
see
trillion,
parameter
models
and
it'll
just
turn
upwards
into.
A
I
mean
it's,
it's
it's
a
it's
a
frightening
curve
if,
in
fact,
it
doesn't
be
over
at
some
point
just
due
to
availability.
So
there's
a
lot
of
talk
about
how
in
the
world
do
we
possibly
handle
this.
This
gargantuan
growth
in
in
desired
capacity
for
these
giant
models.
D
So
the
y-axis
here
is
the
number
of
parameters.
C
A
One
of
the
other
talks
in
one
of
the
keynotes
they
we're
talking
about.
You
know
moore's
law
and
how
these
things
are
in
different
exponential
lines.
You
know
capacity
more
is
long,
but
not,
and
one
of
the
quotes
out
of
there
was
that
what
I
found
amusing
was
that
the
number
of
people
who
are
predicting
the
death
of
moore's
law
doubles
every
year.
A
C
D
B
Yeah,
what
I
find
interesting
about
this
is
that
the
you
know,
as
you
point
out,
people
have
been
predicting
the
death
of
moore's
law
for
a
long
long
time
going
back
to
the
80s
and-
and
it
didn't
happen,
and
why
didn't
it
happen?
B
Well,
they
just
got
clever
more
and
more
clever
on
how
to
use
cmos
and
making
smaller
features
on
silicon
chips
and
so
on
and
and
every
time
they
did
that,
of
course
they
actually
reduced
the
power
consumed
by
individual
transistors,
and
things
like
that
here
we
have
it
just
sort
of
a
different
issue
and
the
question
is
you
say
like
well,
you
know
people,
let's
assume
people
are
wrong
about
that.
You
know
this
can't
keep
scaling.
B
B
You
know
one
possible
answer:
one
possible
answer
is
varsity
and
I
don't
know
if
there's
others,
but
if
that's
true
that
says,
varsity
will
be
at
the
center
of
all
this
stuff
going
forward
that
it
will
be
the
the
absolute
requirement
to
continue
moving
this
direction.
It
just
puts.
B
A
Yeah
well,
both
of
the
keynotes
basically
took
different
tacks
on
that.
I'm
covering
one
of
them
and
the
the
other
one
is
is,
is
available
too.
So
one
keynote
was
from
intel,
of
course,
just
kind
of
a
as
a
spoiler.
They
they
see
a
a
path
to
50x
increase
in
transistor
density
and
google
took
the
other
approach
and
they
basically
because.
A
B
That
might
really
be
again
hinged
on
certain
optimizations
of
the
sparsity,
not
just
that
they
can
make
the
transistors
smaller.
A
A
A
There's
still
a
power
problem,
but
that,
if,
if
you
can
stack
them
indefinitely,
if
you
can
interpose
cooling
layers
in
between
all
of
those
things,
so
in
that
sense
you
could
keep
on
going.
A
Well,
your
power
dissipation
is
a
function
of
surface
area,
and,
if
you
can,
you
can
basically,
instead
of
being
at
a
solid
block
of
silicon,
if
you
can
break
it
up
and
increase.
B
It
I
I
understand
that
it's
yeah
well,
I
I
interesting
okay,
so
that
was
one
three
dimensions
with
cooling
right.
What's
that.
A
They're,
basically,
there
are
ways
of
stacking
transistors
as
wires
and
stacking
them
vertically
in
place
kind
of,
like
you
know,
taking
finfet,
but
going
and
and
stacking
in
that
the.
A
It
does,
but
since
they're
extended
structures,
they
can
actually
probably
force
either
air
or
liquid
through
there
to
do
that.
The
other
thing
is
is
kind
of
the
chiplet
idea
and
they're
they're.
Basically
looking
at
I
mean
it
only
gets
you
to
50x.
I
mean,
if
you
think
about
that,
that's
that
that's
that's
only
you
know
five
doublings
or
or
so
so
it's
it
it.
It
kind
of
runs
out
of
steam.
A
At
some
point,
google's
approach,
on
the
other
hand,
was
that
what
happens
when
you
run
out
of
out
of
room
at
the
bottom
to
kind
of
use,
fame,
feynman's
kind
of
quote
and
they're
looking
and
see
how
they
can
go
in
the
other
direction.
You
know
to
to
to
vast
arrays
of
things
and
then
being
smart
about
that
yeah.
B
But
that's
still,
that's
still
your
your
accumulative
power
is
it's
still
going
to
go
up.
You
know.
A
It
it
is
it's
just
that
it's
it
is,
the
power
will
go
up,
but
it
it's
it's.
Not
it's
not
going
to
be
a
nuclear
core.
B
D
So
jeff
yeah,
I
have
another
answer
to
your
question,
so
you
know
you
mentioned
sparsity
as
one
way
we
that'll
help
give.
Maybe.
B
B
D
The
reason
is
that
if
you
have
the
thousand
brains,
theory
kind
of
idea
and
you
each
cortical
column,
builds
a
model
of
the
world
of
its
world
and
builds
a
predictive
model
of
the
world.
I
think
that
can
be
much
much
more
compact
than
a
system.
That's
simply
storing
every
possible
combination
of
things,
yeah,
absolutely
right.
In
fact,
and
that's
going
to
give-
I
don't
know
how
many
orders
of
magnitude
yeah
so.
B
That's
that's
that
we
we
might
be
right,
but
that
might
be
you're
right
and
I
I
guess
I
was
thinking
like.
Oh
these
people
are
building
these
traditional,
deep
learning
models.
You
know
we're
going
to
take.
You
know
these
language
models
and
just
make
them
bigger
and
bigger
and
bigger
type
of
things
and
but
you're
right.
That's
a
different
breaking
point.
That's
like
yeah,
throw
out
all
those
models
and
build
them.
This.
D
B
And
something
go
ahead,
given
those
models,
how
they're
gonna
scale
that
that
you
could
scale
for
some
reason
or
sparsely
but
you're
right?
Ultimately,
you
might
have
to
abandon
these
kind
of
models,
and
just
you
know,
you're
right.
Of
course
the
brain
is
is
a
super
monstrous
model
and
it
doesn't
take
much
power
at
all.
A
A
We've
talked
to
ray
newer
morphix,
which
you
know
has
that
you
know
very
wide,
sparsely
thing
there's
also,
and
I'm
not
gonna
cover
them,
but
there's
also
photonic
solutions
that
manage
to
to
do
some
amazing
things,
one
of
which
is
that
if
you
basically
have
coherent
light
hitting
a
like
a
diffusion
screen,
a
diffusive
media
that
basically
is
kind
of
a
that's
your
random
sampler
right
out.
There.
B
Yeah
yeah
yeah,
so
well.
Let
me
actually
see
how
this
plays
out.
I
mean,
to
one
hand
the
extent
that
people
can
continue
doing
this
stuff
just
by
doing
clever
pieces
of
hardware
or
the
optics,
or
you
know
that
that's
not
great
for
us
the
extent
that
they
really
hit
some
that
the
way
they're
going
to
keep
going
is
based
on
sparsity
and
a
thousand
brain
theory,
and
so
that's
great
for
us.
B
A
So
there's
a
natural
limit,
I
mean
I
mean
we
hit
300
millimeter
diameter
wafers.
You
know
how
many
years
ago
and
they
haven't
pushed
beyond
that
point
because
you
know
to
build
a
fab
for
that
is
on
the
order
of
like
three
billion
dollars.
Yeah.
B
A
A
Know
I
understand
that
so
so
the
next
one
are
more
g,
whiz
slides.
So
so
let
me
just
go
on
to
there.
So
this
was
the
first
keynote,
so
he
lays
out
the
scope
of
the
exponential
challenges.
The
computation
needs,
the
power,
the
fact
we're
generating
a
huge
amount
of
data,
and
you
know
whether
we
want
to
recycle
that
back
into
machine
learning
is
an
interesting
thing,
so
he
lays
out
this
road
map.
I
don't
show
the
road
map
here.
A
You
have
to
look
kind
of
look
at
the
slides
because
it's
it's
it's
probably
about
you-
know
12
15,
slides
to
show
that
road
map,
so
the
no
transistor
left
behind
was
actually
quote
from
david
blyth,
who
also
spoke
at
the
conference.
So
he
just
wanted
to
give.
You
know
credit
to
where
that
came
from.
A
That's
the
idea
that
you
you
try
to
make
each
transistor
do
something
useful
all
the
time
if
possible,
you
know
so
that
you,
you
don't
waste
resources,
so
he
went
through
and
there's
a
series
of
slides,
but
this
was
the
culmination
of
basically
various
levels
of
technology
nodes
where
you
digitize
everything.
You
network
everything.
Everything
gets
onto
mobile
everything's
in
the
cloud
and
then
you
get
to
exit
scale
where
you
have
100
billion
intelligent
connected
devices.
You
know
and
the
amount
of
compute
that's
associated
with
that.
So
that's
that's
a
g
with
slide.
A
B
Think
the
intelligence
is
not
referring
to
the
kind
of
intelligence
we
talk
about
each
other,
it's
not
or
is
he
talking
about
ai?
Is
he
I
mean
true
ai,
or
is
he
just
saying
well.
A
They're
trying
to
push
smart
stuff
to
the
edge,
which
I
think
that's
what
he's
kind
of
talking
about,
but
I
mean
he
has
some
slides
where
on
on
a
a
graph
where
he
showed
you
know
where
we're
not
anywhere
close
to
reaching,
and
it
was
logarithmic
in
both
dimensions
where
it
was
human
intelligence
and
super
intelligence
he's
just
going
on.
You
know
synapse
numbers
and
stuff
like
that,
but
I
I
figured
that
was.
A
That
was
not
a
really
qualified
slide,
but
here's
the
charmaine,
if
you
didn't
know
what
moore's
law
was
that's
basically
every
two
years
the
idea
was,
was
the
the
how
how
much
more
compute
power
you
can
pack
on
on
the
same
area
on
a
chip
and
that's
that's
been
driving
technology,
at
least
in
the
digital
realm,
for
quite
a
while,
but
now
with
these
models,
we've
gone
to,
rather
than
a
two-year
doubling
we've
gone
to
a
3.4
month,
doubling
of
required
capacity,
and
so
also
the
data
is
exploding.
A
The
data
that
we're
generating
and
we
kind
of
go
out
in
the
world
to
get
data
to
train
these
things.
That's
getting
up
to
you
know
by
2025.
It's
pretty
175
of
zettabytes,
that's
what
10
to
the
15th,
something
like
that,
if
I'm
right
anyway,
big
number
but
his
claim
was
there's
still
plenty
of
room
at
the
bottom,
and
that
was
the
50x,
so
a
couple
of
papers.
I
just
want
to
go
through
quickly
this.
This
was
kind
of
interesting
for
a
couple
of
reasons.
A
So,
basically
the
claim
is
is
that
beijing
inference
is
not
easy
to
do
on
conventional
processors,
so
they
built
their
own.
A
The
idea
is
that
what
you're
trying
to
do
is
if
you're
trying
to
do
something
predictive
using
bayes
rule
and
bayesian
inference
the
fact
that,
rather
than
having
a
point
estimate,
you
have
a
you're
estimating
a
distribution
and
in
cases
where
you
have
incomplete
data.
This
is
you
know,
given
the
available
information,
you
have
here's
your
best
guess
at
doing
stuff,
so
they
claim
to
be
the
first
silicon
accelerator
for
bayesian
inference.
A
They
did
a
hardware
algorithm
co-design
with
parallel.
Is
it
something
something
on
to
carla
always
markov
chain,
and
so
they
applied
it
to
some
unsupervised
tasks,
and
so
this
is
what
it
is
now.
Here's
the
thing
that
I
find
interesting:
they
use
chip
kit
they
actually
they
have
access
to
this
technology
where
they
have
an
incredibly
short
design
cycle
for
something
this
is
from
rtl
to
tape
out
in
three
months
by
five
people.
A
I
mean
that's,
that's
incredible
for
something
of
of
of
this
magnitude
and
if
that's
what's
you
know
if
that
technology
is,
is
really
available,
I
think
it
actually
is
it's
gonna
it's
gonna
eat
into
the
fpga
market.
I
mean
you,
look
you
look
at
how
long
it
takes
sometimes
to
get
to
bring
up
the
fpga
with
all
the
algorithms
you
want,
and
part
of
that
is
you're.
Trying
to
you
know,
work
within
the
constraints
of
of
the
fpga
chip.
A
But
if
you
just
use
standard
library
cells
and
do
place
and
route
put
things
exactly
where
you
want
them
in
exactly
the
right
type,
you
have
a
more
optimal
design.
I
mean
it
will
not
be
as
as
it
won't
be
necessarily
reprogrammable,
but
at
least
it
allows
you
to
try
things
out
with
relatively
low
overhead.
A
You
know
in
terms
of
resources,
so
this
is
just
a
quick
picture
of
where
they
started
with
the
input
where
they've
knocked
out
some
data
or
made
it
credited
up
and
then,
after
some
number
of
iterations
here's
how
much
it
tries
it
out
this
one.
You
know
you
know
it
didn't,
have
much
to
really
pull
this
together.
So
I
don't
think
that
against
the
baseline,
it
was
doing
you
know
a
hell
of
a
good
job
for
some
of
these
other
ones.
A
Okay,
tens
torrent,
so
they
had
a
talk,
neurons
versus
nand
gates
versus
networks
to
find
the
right,
compute
substrate,
artificial
intelligence.
A
They
have,
I
think,
a
complementary
or
not
complimentary,
but
a
similar
philosophy
to
to
the
manta
in
some
ways
except
they're
way,
more
hardware
oriented,
so
they
were
found
in
2016.
A
They
got
70
people,
they
have
people
with
a
spectrum
of
backgrounds
in
in
architectures,
their
ml
inference
and
training,
training,
they're
looking
at
anything
from
edge
to
data
center,
and
they
want
general
purpose
high,
throughput,
parallel
computation,
so
here's
their
particular
talk
of
where
we
need
to
be
where
ml
compute
demand
is-
and
this
is
this-
is
the
the
moore's
law
for
clusters
and
then
for
mega
clusters,
which
would
be
like
warehouse
level
and
where
we
have
to
go
and
where
moore's
law
is
going
to
take
us.
A
So
you
know
stating
that
there's
a
there's,
a
problem
in
this
match,
so
their
goal,
I
mean
ambitious,
is
largest
clusters
ever
so
they
want
networking
computing
to
chip
and
they
basically
want
one
device
in
pipe.
In
other
words,
they
want
to
take
pie,
torch,
define
your
model
in
pie,
torch
and
then
have
it
scale
out
seamlessly
across
this
level
of
hierarchy.
D
A
These
these
are
in
some
ways.
It's
like
a
cerebros
in
the
sense
that
each
of
these
are
individual
processors.
You
have
memory
on
the
side
you
have
I
o
coming
in
here,
and
this
is
a
I
guess,
a
connection
network.
Well,
that's.
A
A
Yeah
well
so
here's
how
they
kind
of
map
things
on
out
where
they
basically
take
these
guys
into
groups
to
define
various
stages.
So
if
you
remember
the
model.
A
Yes,
I
mean
they're
they're
thinking
ahead
to
how
you
would
scale
it
out
all
right,
and
so
they
they
basically
say
how
do
we
map
pipeline
parallelism
and
model
parallelism?
You
know
onto
these
array
processors
similar
in
that
sense
to
cerebrus.
A
So
jawbridge
was
the
first
prototype
greyskull's
their
current
one
and
worm
holds
the
next
one.
That
they're
going
for
the
noted
fact
here
is
this
one's
got:
600
port
16
ports
of
100g
ethernet
one.
This
is
designed
to
go
into
a
a
network
cluster
with
an
integrated
network
switch,
so
they're
they're,
basically
proving
their
technology.
One
step
at
a
time
why?
Why
did
you
say
this
is.
A
B
A
A
Okay,
that's
interesting,
so
this
dynamic
execution
is
where
they
can
actually
have
this
control
flow.
Here
they
do
a
sparse
compute.
Here
they
have
dynamic
precision,
you
use
the
precision
you
need
and
there's
runtime
compression
of
weights
and
activations,
which
is
very
similar
to
what
we're
thinking
about
like.
A
B
A
So
they're
showing
the
matrix
multiplication
they're,
showing
the
results
coming
out
of
here
as
as
sparse,
but
these
are
both
now
what
was
weird-
and
I
I
I
went
on
chat
because
you
can
do
that
on
these
things.
With
these
guys
they
had
a
activation,
sparse,
the
weights
were
dense
and
the
results
were
sparse.
Well,
okay,
but
they
were
gaining.
You
know
100x
max
boost
they
weren't
taking
advantage
of
weight
sparsity.
A
Somehow
he
had
this
notion
that
if
you
had
a
cascade
set
of
these
sparse
results,
somehow
you'd
want
to
multiply
them
together,
and
I
asked
her
what
the
use
case
for
that
was,
and
I
didn't
really
get
a
good
answer
and
david
cantor
from
google.
You
know
agreed
with
me.
He
says
I
thought
he
was.
He
just
meant
this
to
be
sparse
as
well.
So
spar
sparse,
you
know,
you
know
get
you
know
new
sparks.
A
B
A
You
don't
believe
that,
well,
he
say
only
in
the
use
case
of
transformers
where
you're
multiplying
activations,
together
as
opposed
to
weights.
D
B
D
They're
not
no
most
so
here
they,
they
can
only
do
dense
weight,
matrices
and
the
bulk
of
the
computers
and
the
multiplying
against
the
weight.
So
you
would
not
get
the
100x
there,
but
part.
B
B
A
Yeah,
I'm
actually
I've
only
got
three
more
slides
after
this,
so
so
they
claim
they
have
full
high
torch
integration.
They
saying
you
know,
you
know
which
flows
they
can
take
in.
They
basically
use
onyx
as
the
lingua
franca
in
order
to
feed
themselves
into
their.
A
So
basically,
the
the
notion
is
they
when
they
say
a
device,
I'm
assuming
that
they're
talking
about
that
in
in
pi
torch
language,
and
it
says
basically,
we
can
map
out
no
matter
what
the
size
of
the
computer
is,
but
I'm
presuming
they
mean
their
computer
and
it
says
pre-trained
models
can
benefit
from
their
tens
torrent
features.
So
that's
the
automatic
deployment
flow
concept.
D
It's
nice
to
see
them
be
native
high
torch
as
as
yeah
completely
because
most
of
the
stuff
starts
with
tensorflow
and
then
right
torch
is
a
afterthought.
A
Well,
I
mean
that's,
that's
the
feat
of
the
model,
but
the,
but
they
also
take
these
other
flows
as
well.
So
yeah.
D
But
onyx
is
very
restrictive,
so
if
we
anything
that
has
to
go
through
onyx,
it's
hard
to
do
something
really
novel
there:
okay,
whereas
if
it's
just
if
you
look
at
just
the
left,
pi
torch
arrow,
you
know
with
full
support
for
conditionals
and
torch
script
and
so
on.
That's
that's
pretty
good.
A
Okay,
so
here's
what
they
claim
to
have
the
capability
on.
They
have
models
ready
for
these
guys.
Here's
where
they're
they're
imagining
this
stuff
is
gonna,
be
applicable.
A
They
used
some
examples
I
think
in
their
in
the
other,
slides
for
vg
vgg
vgg,
yes,
resnet50,
there's
others
looking
at
you
know
deployment
in
those
areas,
but
that
of
course
requires
pretty
much.
You
know
other
engagements
I
would
guess
so
they
say
their
public
data
is
on
our
development
cloud
on
november
1st,
and
so
they
currently
are
doing
evaluations
of
their
of
their
product.
A
So
65
watts
is
what
their
grade
skull
runs
at
and
here's
their
bur
inference
performance.
A
Now
they
basically
are
looking
at
these
notion
of
conditionals,
with
light
conditional
execution,
mixed
precision,
moderate
conditional
execution
that
that
that
dynamic
computation
they
mentioned
over
there
where
they
can,
I
guess
they
can
switch
between
various
blocks.
I
think
that's
where
they're
talking
about
the
conditional
execution
and
when
they
can
do
that,
apparently
they
you
know,
boost
their
performance
by
a
considerable
amount,
but
even
from
the
slides
that
I
saw
they
they
didn't
really.
It
has
to
be
in
the
talk,
but
from
the
slides.
A
They
didn't
mention
a
lot
about
what
they're
meaning
about
conditional
execution.
But
it's
something
to
look
at,
because
that
crops
up
a
lot
places
you
have
predicated
execution
or
conditional
execution
where
the
cost
of
of
taking
a
branch
or
going
one
way
versus
another
can
be
expensive
in
a
lot
of
architectures.
A
A
Yeah,
that
would
probably
be
my
guess:
yeah
yeah.
I
was
looking
at
that
and
saying
okay,
so
it's
the
like.
I
said
you
know
that
I
mean.
Obviously
these
are
just
a
slot,
I'm
just
pulling
from
their
slide
deck
selected
things,
and
these
are
just
a
fraction
of
the
slides,
but
just
just
to
kind
of
drive
the
story
forward.
A
So
anyway,
I
will
post
this
up.
Someplace
and
super
typically
give
me
some
place
where
I
can
upload
the
all
the
pdfs
I
can.
I
can
do
that
as
well.
Okay,
people
can
look
at
it.