►
From YouTube: Python on GPUs JAX
Description
Part of the Data Day 2022 October 26-27, 2022
Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.
A
On
so
my
name
is
nestada
I'm,
currently,
a
positive
chart
nest
and
I'm
working
and
parting,
the
those
software
framework
to
build
meter
and
in
particular
to
GPU,
which
led
me
to
think
about
how
do
we
pop
python
code
to
GPU
efficiently
and
the
thing
I'm
going
to
show
you
today
is
track,
switch
to
give
you
a
slight
teaser
right
now.
Let
us
like
reported
a
bunch
of
candles
and
we
got
things
like
a
Time
16
speed
up
fairly
easily,
which
is
nice,
but
first,
how
do
you
pop
python
code,
2gp?
A
What
other
device
approaches
that
are
available?
The
easiest
one
is
to
use
of
the
Shelf
cannons.
Maybe
you
have
some
numpy
called
use:
sci-fi
a
user
use
qpi,
something
plus
app
icon.
If
you're
using
planners
and
exactly
learn,
you
can
use
Rapids
and
that's
going
to
be
super
easy
to
use
because,
basically
it's
plug
and
play
and
it's
perfect,
there
is
one
kernel,
but
already
does
the
thing
you
want.
You
want
to
solve
it
in
our
system.
You
can
use
an
existing
kernel
and
your
problem
is
so
that's
very
good.
A
The
problem
comes
when
you
have
something
more
specific
in
mind
when
you
want
to
hide
something
more
tailored
to
your
particular
scientific
problem,
then
you
can
try
to
combine
those
like
you
can
try
to
use
cubai
as
you
reduce
numpy,
to
build
your
your
application
in
particular
algorithm.
But
then
you
allocate
allocating
a
lot
more
intermediate
values.
You
have
more
data
transfer
to
the
GPU
and
the
performance
starts
to
behind
very
quickly.
A
Another
idea
is
to
use
a
deep
learning
library
like
Python's
transferflower
tracks,
which
works
great
if
you're
doing
deep
learning,
obviously-
and
they
are
very
easy
to
use
that
when
documented
there
are
thousands
of
users,
so
that's
very
nice
and
clearly
they
are
doing
things
on
GPU
in
Python,
so
something
is
working
for
them.
They
have
most
useful
numerical
building
blocks
that
pass
for
your
transform,
linear,
algebra
under
number
generators.
All
the
things
we
want
in
numerical
applications,
but
also
when
you
try
and
use
them
to
actually
do
GPU
Computing.
A
What
you
realize
is
they
tend
to
have
a
very
large
overhead,
because
they're
optimize
for
a
different
use
case.
Most
of
the
time.
The
thing
that's
expensive
for
them:
it's
going
to
be
the
garden
competition
during
the
training,
so
they're
going
to
make
that
as
cheap
as
they
can
and,
for
example,
if
you're
using
pytorch.
Even
if
you
tell
pythons
that
you
don't
care
about
the
Guardians,
it's
still
going
to
be
more
expensive
than
if
you
were
able
to
cut
the
parts
of
the
card,
but
what
are
used
for
the
gardens.
A
So
there
are
some.
There
is
some
non-trivial
overhead
coming
from
that,
that
is
non-trivial
and
how
to
eliminate.
Maybe
you
know
what
you're
doing
with
a
GPU
and
you're
thinking?
Okay,
let's
hide
the
canon
in
Kuda
pencil
if
cycle,
something
like
that,
it's
going
to
be
very
low
level,
but
also
very
fast,
and
then
you
link
it
into
python
using
something
like
python,
CL
or
pipoda,
and
that
gives
you
great
performance
if
you
know
what
you're
doing.
A
But
it's
going
to
make
it
much
harder
to
use
those
numerical
building
blocks
things
like
random
numbers,
password
linear
algebra,
because
they
are
linear,
algebra,
Cuda,
libraries
for
example.
But
if
you
need
to
use
the
linear
solver
inside
the
loop
inside
the
loop
inside
a
loop
in
your
kernel,
then
you're
in
for
World
of
Pain.
Also
it
requires
a
lot
of
expect
of
expertise
into
high
performance
Computing,
because
the
karmic
code,
that
is
they
can
let
you
write
code
that
is
very
performant
but
I,
think
covers.
Actually
government
takes
a
lot
of
time
and
experience.
A
If
you
don't
know
what
shell
memory
is
on
GPU
same
thing,
you
need
to
learn
a
lot
of
things
before
you
start
acting
up.
That's
interestingly
performance
to
make
it
harder
to
act
correct,
but
because
you're
working
at
a
lower
level,
and
once
you've
done
all
of
that,
you
still
need
to
manage
to
compile
your
code
with
something
other
than
Python
and
Link
it
into
your
python.
A
In
Python,
you
could
use
this
number
and
it
has
a
low
level
very
quickly,
like
syntax
and
I'm,
saying,
learn
available,
because
you
come
back
to
all
of
the
Kuda
construction
memories
there
all
of
them
are
going
to
be
there.
There
are
other
things
like
that.
Actually,
the
nice
thing
is
you're
going
to
be
in
Python,
so
you
don't
have
to
care
about
compiling
and
linking
or
that
is
taken
care
of
for
you,
but
you're
still
very
low
level,
and
it's
still
very
hard
to
call
all
of
those
numerical
building
blocks.
A
So
my
question
was:
is
there
a
way
to
have
good,
GPU
performance
portability
and
productivity
in
Python?
Is
there
a
solution
and
the
thing
I
found
that
worked
really
nicely
for
our
particular
use
case
is
tracks.
So
what
is
drugs?
Drugs
is
a
python
library
but
lets
you
write
code
in
Python
and
then
run
it
and
whatever
Hardware
you
are,
which
could
be
your
CPU
GPU
Nvidia
IMD,
a
TPU
with
your
Google
or
something
else.
A
It
also
works
on
Specialized
hardware,
for
the
planning
and
at
first
it
was
developed
by
Google
as
a
building
block
for
the
Deep
learning
Frameworks.
Now
it's
something
able
to
so
that
they
wouldn't
be
able
to
hide
Python
and
run
it
on
GPU,
but
it
is
seeing
why
the
application
in
numerical
application
application
things
like
molecular
Dynamics
or
some
simulation
computational
free
Dynamics.
A
So
what
does
it
look
like?
It
looks
a
lot
like
numpy
like
here.
We
have
a
cut
sample
and
you
see
we
are
importing
something
called
numpy
from
Jax.
We
are
generating
some
random
numbers.
We
come
back
to
that
later
and
then
we
are
using
dot.
Also
about
array
and
array
dot.
T
and
people
in
the
numpies
are
going
to
numpy
are
going
to
recognize
that
this
is
exactly
what
they
will
write
it
in
then
by
except,
but
this
is
going
to
run
on
GPU,
which
is
very
nice
what's
happening
in
the
world.
A
Is
you
have
a
Justin
type
compiler?
Whenever
you
call
a
Jack's
function,
it's
going
to
be
test,
so
Jax
is
going
to
look
at
the
shape
of
your
inputs,
then
the
shape
of
the
output
of
each
computation.
So
you
have
a
sum.
It
looks
at
the
shape
of
the
input,
the
shape
of
the
output
of
the
sum
and
builds
the
computational
graph
like
that,
once
there's
a
computational
graph,
it
passes
it
to
The
XLE
compiler,
that's
going
to
actually
compile
it
for
your
current
Hardware.
A
So
what
you
have
is
a
just
in
time.
Compiler
compilation
happens
at
runtime,
which
has
a
price
like
in
C
plus
plus.
If
you've
got
takes
15
minutes
to
compile.
That's
fine
in
Jacks.
If
it
took
15
minutes
to
combine
your
card
will
be
15
minutes
slower
in
practice
your
below
one
second,
and
unless
you
have
a
problem
somewhere
also,
it
means
the
input
sizes
must
be
known
to
the
Tyson,
not
only
the
input
sizes,
but
the
sizes
of
evidence
are
inside
the
computation.
A
So
you
cannot
have
the
size
of
an
intermediate
result:
BF
function
of
the
data
which
sounded
very
restrictive
when
we
started,
and
then
we
realized
that
with
padding
masking
the
compiling
for
by
sizes,
you
can
work
around
that.
Also.
The
fact
that
you
cannot
have
things
like
dependent
on
the
data,
at
least
for
most
things
means
that
looks
and
tests
are
restricted
and
something
that
that
felt
very
restrictive.
When
we
started
and
then
we
realized
Jax
is
providing
a
bunch
of
function,
you
can
work
on
that.
A
They
also
don't
have
those
side
effects
in
place,
modifications
and
they
focus
on
GPU,
optimization,
meaning
the
compile
the
drugs
compiler
is
really
good.
On
tpus,
it's
very
nice
on
GPU,
but
on
CPU
we
found
We
have
basically
single
car,
C,
plus
plus
performance,
which
is
better
than
python,
but
not
acceptable.
If
you're
running
on
parameter
CP,
so
it
depends
on
your
use
case,
given
all
of
that,
how
can
we
actually
use
drugs
because
that's
a
long
list
of
limitations
and
then
we
are
going
to
see
up
to
see?
A
Is
it
worth
it
to
actually
use
it?
So
I
told
you
Jax
looks
a
lot
like
numpy
like
here.
You
have
a
Jack
Scott,
that's
walking,
Jack's
Garden,
that's
been
a
bicycle
and
that's
really
nice,
because
if
you're
already
in
an
Empire
you're
90
of
the
weather
and
if
you
don't
know
how
to
write
something
in
drugs
very
often
you
can
just
search
on
Google
how
to
do
that
in
numpy,
and
the
answer
is
going
to
be
very
juxtap,
which
is
very
useful
because
it
jump
starts
you
into
using
the
library
very
very
quickly.
A
Now,
let's
look
at
where
it
diverge,
I
told
you
you're
not
allowed
to
mutate
thing.
So
if
you're
done,
I
really
wanted
to
update
it
add
one
to
all
of
its
value.
You
have
to
create
a
new
array
and
you
can
call
it
like
the
previous
one.
So
that
way,
it's
functionally
identical
to
mutation
ba
is
pretty
much
the
same
inside
your
function.
A
If
you
wanted
to
modify
an
index
that
provides
some
function
to
deal
with
that,
so
there
is
an
add
function,
you
can
modify
it
at
the
index.
That
could
be
another
indices,
and
this
is
going
to
produce
a
new
array.
There
is
no
actual
needability
happening
if
you
want
to
update
something
that
are
all
the
usual
increment
increment
operations
and
something
that's
interesting
is
that
Jax
makes
parallelism
transparent,
meaning
if
you
were
to
write
a
Cuda
kernel
and
you
wanted
to
modify
a
bunch
of
indices
on
some
of
those
indices
could
be
identical.
A
You
need
to
think
about.
Oh,
why
am
I
going
to
deliver
synchronization,
maybe
I'm
going
to
use
anatomic
index
parallelism
is
transparent
and
Jax
is
going
to
take
care
of
that.
For
you,
this
operation
is
going
to
be
atomic
and
if
the
compiler
decides
oh,
these
might
cause
problems.
It's
going
to
put
security
safety
is
on
top
of
it
for
you,
you
don't
have
to
think
about
it.
A
At
any
point
also,
I
told
you
that
Jack's
code
is
compiled
and
that's
important,
because
if
you
hijacks
just
like
that,
it's
going
to
be
very
slow,
you
need
to
compile
it's
not
quite
to
be
fast
and
so
to
compile
it.
You
have
a
function
called
jits.
So
here
we
have
a
demo
function
called
f
and
to
get
a
compiled
version
of
that
we
call
g10f
we
get
fitted
and
whenever
we
call
F
and
something
it's
going
to
trace
the
function
by
running
it
on
a
on
a
shape.
A
Basically
that
doesn't
contain
any
data,
so
that
is
going
to
trigger
our
print
and
once
it
is
traced
it's
going
to
be
sent
to
the
compiler,
the
compiled
version
going
to
be
created
to
be
run
and
the
next
time
we
run
our
function
and
inputs
with
the
same
size
going
to
detect.
Oh,
we
have
already
compiled
for
that
given
size.
So
let's
reuse
it
if
you
use,
if
you
pass
new
inputs
with
different
sizes
to
that
function,
is
going
to
be
recompiled.
A
So
that's
something
you
have
to
be
aware
of,
but
on
the
plus
side,
the
compiler,
knowing
the
size
can
do
some
very
clever
things
can
detects.
Oh,
that
look
is
going
to
be
tiny,
so
we're
not
going
to
be
working
on
Parallel
design,
parallelizing
that
one,
let's
paralyzed
this
other
one
and
the
past
other
inputs,
and
it's
like,
oh,
but
in
that
case
this
is
the
loop.
That's
going
to
be
the
Outlook.
This
is
where
we
should
be
focusing.
So
it
lets
the
compiler
be
very
clever,
which
is
nice.
A
Also
I
told
you,
you
cannot
depend
on
the
value
inside
Function
One
exception
to
that
is
static
values.
You
can
tell
the
compiler.
This
value
is
always
going
to
be
the
same
use
it
when
you're
tracing.
So
here,
for
example,
you
pass
a
Boolean
and
we
delete.
This
is
going
to
be
static,
and
so
you
can
test
on
whatever
you
want
on
that
Boolean
things
work
and
that
test
is
going
to
be
led
that
tracing
time
and
that
helps
the
optimizer
also
like
to
take
care
of
things
to
simplify
things.
A
You
can
do
something
which,
which
is
called
donating
inputs.
So
that's
not
often
useful,
but
when
it's
useful,
it's
going
to
be
a
performance
benefit,
a
significant
performance
benefit,
sometimes
she's
by
default.
That
function
takes
an
input,
that's
something
returns
an
output
and
that
output
is
going
to
be
a
new
array.
But
if
you
know
your
input
is
never
going
to
be
reused,
you
can
tell
the
compiler
I,
don't
care
about
my
impacts,
feel
free
to
reuse
it
to
recycle
it,
and
then
the
compiler
is
going
to
be
able
to
say.
A
Oh,
this
should
be
an
In-Place
operation,
which
is
something
that's
important,
because
here
you
saw
we
did.
We
created
a
new
array
about
this
scenario
and
some
people
might
be
thinking.
Oh,
that's
going
to
be
terrible,
like
we're
we're
using
a
lot
of
memory
from
nothing.
If
you
do
that
inside
the
compile
section
the
compiler
is
going
to
be
able
to
say:
oh,
we
can
do
an
in
place
modification
or
sometimes
to
say
that's
going
to
be
useful
to
have
both
the
previous
version
and
the
new
version.
A
Given
what
we're
doing
so,
let's
keep
both
version.
So
the
compiler
has
the
the
leeway
to
be
more
clever
than
we
are,
which
is
nuts,
so
we
can
donate
inputs
to
deal
with
tests
like
to
actually
do
test
on
the
value
of
things.
You
have
two
main
ways
you
can
call
numpy,
where
you
give
it
a
mask
or
Boolean
something
to
return
the
true
case,
something
to
return
in
the
false
case
and
that's
useful
when
the
computation
of
both
branches
is
fairly
inexpensive,
which,
on
GPU
most
computations
are
fine
expensive.
A
If
they
are
actually
expensive,
there
is
a
luxcom
function,
you
pass
it
a
Boolean
function,
2
and
the
true
case
function
to
and
the
false
case
and
the
inputs
to
give
to
either
of
those
functions
and
that
deals
with
tests.
There
is
also
a
way
to
deal
with
loops
same
thing:
you
have
some
while
loop
and
file
Loop
Primitives
and,
more
importantly,
there
are
some
vectorization
Primitives.
A
So,
to
give
you
an
example:
here
we
have
a
double
loop
on
I
and
then
G
and
in
the
res,
in
IG
of
the
result,
we
apply
the
body,
error
function
and
the
slice
on
X
and
Y,
and
if
we
call
x-map,
which
is
something
that's
going
to
vectorize
our
current
and
that
function
body,
we
can
tell
it.
Oh
on
the
first
input
we
are
going
to
slice
it
like
this
second
impact
with
slices.
A
We
slice
it
like
this
and
the
output
is
going
to
be
organized
like
that
is
going
to
take
our
body
function
and
turn
it
into
a
function
that
can
process
the
full
block
of
indices
at
once
and
that's
the
pattern
you
find
the
surprisingly
open
in
codes
and
the
more
you
use
it.
The
more
you
see
it
everywhere,
but
the
GPU
really
loves
that,
because
basically
you're
able
to
run
your
loop
as
a
single
block
under
GPU
and
that's
very
performance,
and
here
I'm
using
xmap.
A
There
is
also
the
map
which
is
going
to,
which
is
a
less
powerful
version
of
its
map,
that
works
on
a
single
index
and
another
thing
is
pair
map,
which
is
what,
if
you
have
your
own
parameter,
you
have
four
gpus.
Where
you
can,
you
can
use
Spam
app
and
it's
going
to
run
in
parallel
on
your
4GB.
So
that's
something
that
you
can
do
in
tracks.
A
Automatic
differentiation,
which
is
where
it
differs
from
most
of
the
planning
Frameworks,
where
I
told
you
like
a
lot
of
overheads
coming
just
from
all
these
infrastructure
to
computer
gradient
Jax
does
guardian
computation
by
cut
transformation,
meaning
the
overhead
of
there
is
no
override
to
something
that
does
not
care
about
the
guardian,
because
nothing
is
there.
They
deal
with
the
Guardians,
and
when
you
want
the
gradient
of
the
function,
you
can
call
card
on
that
function
and
it's
going
to
transform
your
code
and
produce
something
that's
about
as
fast
as
an
analytic
solution.
A
So
you
get
very,
very
fast,
current
computation
and
no
other
element.
You
don't
care
about
clutch,
that's
a
zero
cost
abstraction,
which
is
very
nice.
Obviously
summer
passion
cannot
be
differentiated
because
some
things
cannot
be
done
and
to
end
on
the
how
to
use
track
section.
There
are
some
very
simple
performance
tricks
that
are
worth
thinking
about.
A
A
If
we
are
not
mutating
data,
we
could
we
could
have
compilers
that
are
so
much
clever
and
the
the
thing
behind
the
Jacks
compiler
is
to
say:
okay,
let's
do
that,
let's
make
our
life
a
bit
harder
so
that
the
compiler
can
be
more
clever,
and
so
very
often
you
can
you're
going
to
add,
like
maybe
two
lines
inside
the
compiler
section,
and
certainly
the
compiler
realizes.
Oh
I
could
reuse
this
I.
A
Don't
care
that
much
about
that
and
it's
going
to
make
your
cut
five
or
ten
percent
faster,
which
is
always
something
that
is
worth
getting
and
as
always,
when
you're
dealing
with
GPU
Computing,
you
keep
you
keeping
the
data
and
GPU
as
long
as
you
can
is
very
worthwhile.
It
brings
a
lot
of
benefits.
So
here
are
some
libraries
that
are
worth
looking
into.
You
find
a
lot
more
in
the
awesome
Jacks
GitHub
repository.
A
There
is
an
MPI
for
drugs,
library
that
introduce
MPI
Primitives
as
tax
Primitives
that
you
can
use
in
deleted
section,
and
this
is
going
to
use
your
MPI
GPU
super.
If
it
is
there.
So
that's
very
nice.
There
is
something
to
help
you
there's
checks
code
and
make
sure
it
runs
similarly
and
CPU
GPU
GT
section
or
not.
A
What
we
did
is
we
took
test,
which
is
a
fairly
large
python,
application
about
200,
000
lines
of
python
cut,
that's
used
to
Python
and
C,
plus
plus
code.
That's
used
to
study
the
cosmic
microwave
background.
This
is
made
of
several
pipelines
that
are
distributed
with
MPI.
It's
like
it's
able
to
use
the
perimeter
supercomputer
at
full
capacity,
and
all
of
those
pipelines
are
composed
of
C,
plus
plus
kernels
that
are
parallelized,
and
this
care
analysis
pretty
much
everything
you
can
think
of.
There
are
some
random
number
of
generators.
A
There
is
some
fast
Fourier
transformation
going
on
linear,
algebra,
obviously
sparse
matrices.
All
of
the
things
are
somewhere
in
there
and
what
we
did
is
we
took
two
of
those
pipelines
spotted
all
of
the
kernels.
To
answer
the
question
of:
is
it
doable,
given
all
the
restrictions
shown
below
and
is
it
worth
it?
Is
it
actually
performance?
A
So
we
parted
all
of
its
kernels
first
from
C,
plus
plus
to
numpy
event
to
Jacks,
trying
to
keep
the
interface
face,
identical
to
be
able
to
use
our
unit
test
to
make
sure
that
our
part
is
actually
functioning
the
way
it
should
be,
and
the
things
we
found
is.
First,
we
had
a
bunch
of
kernels
that
had
loops
on
irregular
intervals.
An
intervals
with
size
is
dynamic
and
function
of
the
data,
which
is
something
I
told
you
Jax
does
not
like.
So
we
we
realized,
oh
just
with
some
padding
and
masking.
A
We
can
work
with
that.
We
introduce
the
type
to
I've
talked
about
that
and
our
life
was
nice.
Then
a
bunch
of
our
kernels
meet
data
output
parameters,
which
is
something
Jax
also
currency
should
not
be
doing
so.
We
introduces,
we
introduced
the
mutable
Jack,
sorry,
which
is
just
boxing
a
jack
sarain.
Whenever
you
mutate
it,
it
replaces
replace
it
with
a
new
one
and
that's
a
term
abstraction,
but
the
G
compiler
does
a
good
job
with
it.
A
So
we
are
happy
about
that
and
we
worked
a
lot
on
reducing
data
movement
and
that's
something
we're
still
working
on,
because
we
have
a
pipeline
abstraction
and
there
are
ways
to
improve.
On
that
doing
all
of
this
we
got
a
seven
times
reduction
in
lines
of
code,
so
for
all
of
the
card
that
is
used
inside
these
pipelines.
Our
code
is
now
seven
times
shorter,
going
from
the
C
plus
plus
python
version
to
the
Jax
version,
and
if
you
look
at
the
so
those
are
all
of
the
candles
that
have
been
partied.
A
And
if
you
look
there,
you
can
tell
me:
okay,
none
of
those
bar
is
seven
times
shorter,
some
of
them
maybe
five
times
but
nothing,
seven
times
shorter.
That's
because
a
lot
of
the
reduction
was
not
in
the
actual
kernels,
but
in
the
utility
functions
since
we're
using
Jax,
which
gives
us
access
to
them
by
sci-fi
and
a
bunch
of
other
things.
We
don't
have
to
hide
our
own
binding
to
the
mkl.
We
don't
have
to
hack
things
like
how
to
normalize
Vector
all
of
those
things.
A
So
that
was
a
lot
of
Reinventing.
The
wheel
that
was
cut
out
of
the
card
and
overall,
the
code
reduction,
means
that
the
cut
is
much
easier
to
keep
in
your
head,
which
is
a
nice
one.
Was
it
worth
it
so
here
we
have
a
bunch
of
timing
for
various
kernels.
The
first
two
lines
are
the
timing
took
to
move
data
to
GPU
and
back
to
GPU,
and
finalize
is
also
a
little
bit
of
cleanup.
A
That's
why
it's
not
the
zero
power
of
an
MP,
and
we
are
comparing
so
openmp
C,
plus
plus
kernels,
that
have
been
optimized.
They
run
on
four
threads
with
four
particular
problems
is
the
fastest
we
can
go
if
you
had
more
heads
bad
things,
happen
and
Jack's
running
on
one
GPU
and
those
kernels
are
slower
in
Jacks
and
that's
because
those
cannot
spend
their
time
getting
their
data
and
sending
it
back
to
and
from
GPU,
but
something
we're
working
on
and
that's
going
to
be
fixed
in
probably
one
week.
A
Do
scanners
are
nicer,
and
here
we
have
a
Time
16
speed
up
and
in
the
first
version
of
this
slide,
we're
at
the
time
62
speed
up
of
that
Cannon,
the
nose
white
Cannon,
and
that
was
such
a
good
speed
up.
We
went
to
look
at
the
card
thinking.
Okay,
there
is
a
problem
somewhere,
that's
not
normal
and
we
found
the
problem.
We
had
the
performance
bag
in
our
C
plus
plus
card.
We
had
a
critical
section
inside
the
parallel
Loop
that
was
making
the
code
slower
than
the
second
shot.
A
We
fixed
it
and
no
things
are
much
better
in
the
C
plus
plus
card.
Something
that's
interesting
with
Jax,
which
is
parallelism,
is
transparent
in
drugs.
You
don't
have
to
think
about
Buddhism,
meaning
you
cannot
introduce
performance
perks
by
making
mistakes
in
the
way
you
paralyze
your
car
and
that's
really
useful
for
people
who
are
primarily
domain
expats.
We
know
that
science
standard
cosmology,
but
writing
high
performance
code
is
not
their
main
field
of
study.
A
So
this
is
a
proof
of
concept.
You
have
partied
a
bunch
of
kernels
two
pipelines,
it's
a
work
in
progress,
but
we
found
that's
doable,
which
is
very
nice
and
that's
worth
doing,
because
the
code
is
easier
to
understand,
shorter
and
faster.
What
we
could
do
to
go
faster
to
go
faster
to
go
further
is
to
reduce
that
movement.
That's
a
work,
that's
going
on
where
still
a
lot
of
data
movement,
that
is
avoidable.
A
Also,
a
lot
of
our
Cloud
complexity
comes
from
the
fact
that
we're
trying
to
preserve
the
interface
of
the
super
space
kernels,
but
we
could
also
defer
to
drug
size
to
Pure
functions
and
then
our
code
will
be
significantly
simpler
and
we
could-
and
if
we
did
all
of
that,
we
could
redesign
our
pipeline
to
get
them
as
gitaval
stereo
kernel
into
a
single,
huge
GPU
kernel
on
the
Fly
and
that
will
bring.
We
expect
lots
of
government's
benefits.
A
A
A
A
Is
that
a
sweet
spot
in
the
design,
Pareto
phones
for
people
who
are
doing
research
on
complex
Miracle
card,
because
it
lets
you
write
code,
what
is
easier
to
make
correct
fast
and
that
very
quickly,
so
you're
being
very
productive
because
lets
you
focus
on
the
semantic
of
the
card
and
separates
neatly
between
the
semantic
of
the
coverage
is
the
thing
you're
dealing
with
as
a
domain
expert
and
the
optimization
which
is
led
to
the
compiler,
and
you
had
all
of
his
restrictions.
You
like
the
compiler
will
be
as
clever
as
it
can
be.
A
Also
it
lets
you
have
a
single
cut
Base
today
with
both
CPU
and
GPU
code,
and
that's
really
nice,
because
you
can
test
your
code
in
your
guitar
CI
and
then
run
it
on
GPU
on
parameter
and
that's
going
to
be
the
same
code
and
behaving
the
same
way.
That's
very
practical.
Also,
the
immutable
design
limitation
are
actually
very
nice
for
correctness.
What
we
found
is,
after
having
parts
of
the
code
we
found
bugs
in
the
suppress
plus
part
of
the
gun,
but
we
are
not
present
in
the
jack
spot
because
it
was
immutable.
A
So
we
were
not
allowed
to
some
things
that
we
have
into
this
products
and
Mexico
really
really
easy
to
use
numerical
building
blocks
because
you're
using
the
things
you're
used
to
be
using
in
Python
like
numpy
and
sci-fi,
and
so,
if
you
need
to
solve
a
linear
system
inside
a
loop
inside
loop
inside
the
loop
inside
your
kernel,
which
is
a
real
thing,
I
did
like
two
weeks
ago.
You
can
do
it,
that's
going
to
be
very
easy,
and
that
is
very
practical,
very
important.