►
From YouTube: Programming Hybrid HPC Systems: What is Old is New Again
Description
Viktor K. Decyk, UCLA
A
Victor
desick,
he
is
a
research
physicist
and
adjunct.
Professor
at
ucla
he's
been
involved
in
using
and
developing
particle
and
cell
or
everybody
calls
pic.
Everybody
knows
what
we
mean
with
say
pic
his
entire
career,
beginning
as
a
graduate
student
under
john
dawson,
one
of
the
pioneers
of
the
technique
he's
been
active
in
parallel
computing
since
1987..
A
B
All
right,
so
this
talk
is
really
about
programming
and
recognizing
that
a
lot
of
what
looks
like
new
has
actually
been
around
before.
B
But
what
I'm
going
to
talk
about
in
terms
of
programming
specifically
are
pick
codes,
that's
kind
of
what
I
do
so
for
those
of
you
who
aren't
into
plasma
physics
I'll
just
say
what
this
is
so
pick
codes
integrate
the
trajectories
of
many
particles,
interacting
self-consistently,
by
their
electromagnetic
fields.
So
they
model
plasmas
at
the
most
fundamental
and
microscopic
level.
C
C
B
B
So
so
we
so
these
simple
models
are
often
very
fast,
but
you
have
to
be
careful
on
their
use.
So
pick
codes
then
give
you
a
test
bed
to
verify
that
the
more
complete
codes
do
so.
The
largest
calculation
nowadays
is
about
three
trillion,
interacting
particles,
so
they've
gotten
very
big.
B
B
B
B
The
first
machine
I
used
at
nurse,
which
was
then
mfecc,
was
the
a
machine,
a
different
aim,
machine
than
another
aim
machine
that
came
later,
which
was
the
cdc
7600.
One
interesting
feat
designed
by
seymour
craig
before
he
had
play
machines
was
a
machine
had
an
interesting
feature
in
that
it
had
a
memory
hierarchy.
There
was
a
small
core
that
was
fast
and
the
large
core
that
was
slow
and
the
user
managed
that
himself.
Okay.
Well,
what's
interesting,
is
you
see
the
same
stuff
on
gpu?
B
B
So
during
this
time
I
have
experienced
two
revolutions
in
super
computing
and
the
first
was
the
introduction
of
vector
computing,
which
came
in
the
in
the
late
70s
early
80s,
and
we
all
had
to
learn
how
to
vectorize
our
code
and
that
we
had
these
nice
little
crays
and
actually
I
tried
to.
B
Of
the
cray
from
mf
from
nurse
off
the
web,
but
I
couldn't
get
it
off
the
web,
so
I
had
to
get
another
one,
the
anyway
in
1984,
the
performance
of
my
code
was
about
eight
microseconds
per
particle
protection.
B
I
also
went
to
japan
and
used
the
fujitsu,
which
was
actually
even
better,
and
it
was
a
performance
of
about
an
order
magnitude
from
what
we
had
so
vectorization
was
was
cool
but
because
vectorization
is
made
come
back.
Let
me
go
talk
a
little
bit
about
that
in
more
detail,
so
one
place
vectorizer
vector
code
appears
as
it
feels
solid
because
it's
usually
a
fairly
straightforward
and
regular
problem
compared
to
pick
where
particles
are
going
all
over
the
place.
B
So
vectorizable
means
that
the
elements
of
a
vector
can
be
processed
in
any
order.
So
that's
also
the
definition
of
what's
paralyzable
and
another
feature
that
was
important
on
craze
was
that
stride
one
memory
access
was
was
optimal,
so
in
other
words,
if
the,
if
the
memories
of
each
adjacent
vector
were
stored
next
to
one
another.
That
was
the
best
thing
you
could
have
so,
and
that
was
primarily
to
avoid
bank
conflicts,
so
the
memories
were
organized
into
banks.
B
There
was
no
cash
here
and
if
the
worst
thing
would
be
there
were,
I
don't
know
there
was
something
like
64
banks
or
something,
and
the
worst
case
was
if
you
had
a
stride
of
64
because
you'd
hit
the
same
memory
bank
all
the
time,
and
so
you
didn't
want
to
do
that.
But
anyway,
here's
a
actually
a
code
from
similar
to
what
I
had
in
those
days,
and
this
is
a
field
solver.
B
It's
actually
solid,
poisson's
equation,
fourier
space-
and
here
this
is
the
inner
loop
over
my
y
coordinate
and
if
I
ju
and
would,
if
I
just
put
this
in
the
compiler.
D
B
D
B
With
the
case
so
that
these
two
actually
might
be
in
the
same
location,
so
it
wouldn't
do
that.
But
the
fix
was
easy.
You
put
a
little
compiler
director,
which
is
ignore,
vector
dependency
and
it
was
okay.
I
can
go,
and
so
it
would
do
that.
So
it
was
so
code
written
like
that.
It
was
quite
straightforward
and
if
you're,
an
old
timer
with
lots
of
code
like
that
you're
going
to
love
gpus
because
they're,
that's
really
pretty
straightforward.
B
So
I
actually
had
an
early
start
on
this,
because
I
worked
at
jpl
and
caltech,
with
their
very
early
hypercubes
in
the
around
1986
or
so
jeffrey
fox
was
one
of
the
first
people
who
built
the
first
cosmic
cube,
which
was
a
bunch
of
ibm
pcs
connected
somewhat
years
later.
Then
we
had
a
nurse
got
their
t3e.
B
They
had
a
t3d
earlier
than
that
that
we
used.
That
was,
I
don't
know,
I
guess,
a
loner
or
something
or
anyway.
So
the
speed
of
the
code
then
now
went
up
to
13
nanoseconds
per
particle
for
time
step.
So
that's
nearly
a
factor
of
500
over
what
we
had
in
the
early
cray
days.
On
the
other
hand,
the
mac
on
my
desktop
was
running
at
2.4
microseconds.
B
So
conclusion
is
that
the
desktop
machine
became
equal
to
a
super
computer
about
10
years
later,
so
more
or
less.
Not
not
only
that,
but
3d
now
became
practical
before
this.
It
was
really
very
difficult
to
do.
B
3D
there
wasn't
enough
memory,
and
so
we
we
3d
code,
was
running
around
74
nanoseconds
and
we
use
this
machine
for
the
numerical
tokamak
project
back
in
around
1994,
or
so
I
think
originally
with
the
t3d
and
the
mpi
wasn't
even
existing
that
it
was
still
using
pvm
for
the
message
passing
and
so
on.
But
eventually
things
got
standardized
and
it
took
a
while.
B
So
so
I'm
going
to
talk
a
little
bit
about
how
a
few
slides
of
how
this
was
done,
because
it's
going
to
come
back
to
haunt
us
here.
So
one
of
the
ways.
The
thing
is
that
we're
using
distributed
memory.
So
that
means
the
the
processors,
don't
share
any
many
any
memory
whatsoever,
and
so
we,
what
we
do
is
we
partition
the
space
among
processors,
and
so
this
is
called
domain
decomposition.
And
here
we
are
there's
like
six
particles
in
this
this
processor
and
then
there's
six
in
each
of
them.
B
So
when
the
particle
got
to
the
edge
of
the
region
and
moved
into
this
region,
there
was
some
piece
of
software
that
would
make
sure
it
moved
that
checked
and
moved
that
particle
over
to
that
to
that
processor.
So
this
this
worked
really
well.
We
this
is
the
scheme
that
people
basically
use
today
and
for
it
for
the
very
largest
calculations
it
scales
to.
You
know
hundreds
of
thousands
of
processes,
actually
a
million
nowadays.
B
The
other
thing
that
we
did,
that
was
that
we
we
use
what
sometimes
computer
scientists
call
a
kind
of
a
bulk
synchronous
model.
So
we
don't
overlap,
communication
and
and
computation
for
a
couple
of
reasons,
one
there
wasn't
very
little
to
gain
because
the
overlap
occurred
at
the
edges,
which
didn't
really
have
a
big
impact,
and
secondly,
it
really
messed
up
the
modularity
of
the
code
and
made
it
very
difficult
to
do
that.
B
But-
and
it
also
turned
out
that
there
were
really
only
kind
of
four
communication
patterns
that
we
really
needed
for
the
most
part
in
the
code.
So
that
made
it
much
easier
than
you
might
imagine.
When
you
I
mean
computer
scientists
all
complain
about
this
assembly,
language
of
parallel
computing,
but
and
and.
D
E
B
B
If
it's
a
spectral
code,
we
would
change
to
a
uniform
partition
for
the
fft.
Otherwise
we
might
move
partitions
just
for
dynamic
load,
balancing
there's
also
the
fft
use
the
transpose.
So
those
are
the
primary
communication
patterns
that
we
just
reused
over
and
over
again.
B
Okay,
so
just
one
last
thing
to
kind
of
show
you
some
real
code
here,
so
this
is
from
the
particle
manager
using
mpi.
Basically,
particles
which
leave
the
domain
are
buffered
and
then
they're
sent
and
received
to
neighboring
processors.
So
so
this
is
the
send
buffer.
D
B
Loop,
that
does
up
and
down
and
with
in
and
out
with
the
same
subroutine.
So
you
you
send
these
messages
and
when
they
arrive,
these
are
asynchronous.
B
When
you
arrive,
you
can
find
out
how
many
particles
got
there
from
this
get
count
and-
and
that
was
you
know
more
or
less
the
guts
of
of
of
this
thing.
So
you
buffer,
the
particles,
send
them
receiving,
and
then
this,
when
the
receive
buffer,
gets
unpacked
and
inserted
into
the
particle
right.
B
One
really
fun
bug
was
leaving
this
thing
out.
This
is
waiting
for
the
scent
to
go
and
you
don't
really.
You
know.
From
other
reasons,
the
send
was
was
a
cent,
so
you,
you
kind
of
think
you
might
leave
this
out
because
you
don't
need
to
wait.
You
already
know
it's
gone,
that's
a
bad
mistake
and
it
leads
to
a
really
nasty
bug
where
the
code
will
run
for
100
thousand
time
steps
and
crash,
and
the
reason
is
this
actually
has
a
side
effect.
B
D
B
Out
type
of
algorithm
that's
used
in
in
these,
these
parallel
machines
was
was
blocking
or
tiling,
and
the
reason
is
that
they
now
typically
use
risk
processors.
They
had
caches,
the
craze
didn't
have
caches
so
and
the
idea,
basically,
is
you
want
to
process
data
in
small
chunks,
which
you
know
fit
into
the
the
fastest
memory
that
you
have.
So
you
put
it
into
a
fast
memory.
Calculate
then
don't
go.
Do
an
x
joke.
It's
also
useful
for
memory
management,
one
of
the
things
that
is
important
to
think
about.
B
As
we
get
to
higher
and
higher
you
know,
performance
machines
is
the
member
becomes
a
real,
even
more
and
more
of
a
bottle,
and
you
have
to
pay
more
attention.
So
one
thing
that
we
sometimes
use
is
we'll
transpose
the
data,
so
this
data
here
is
in
fortran.
B
B
This
guy,
however,
is
terrible
because
he's
got
a
big
stride,
namely
ny,
and
especially
that
means
that
so
you'd
think
that
wouldn't
matter
so
much
on
the
wrist
process,
I
mean
I'm
afraid
that
was
a
disaster
here,
it's
not
quite
so
bad,
but
the
reason
it
it's
important
is
because
these
systems
bring
data
in
it
with
a
cache
line,
so
they're
not
just
bringing
in
one
piece
of
data
to
bring
in
a
whole
cash
flow,
and
if
the
data
is
adjacent,
you
get
all
the
rest
of
the
data
for
free.
B
D
C
B
C
B
B
Fast
right
because
this
memory
is
really
fast,
so
this
way
you
get
almost
you
get
straggling
coming
in
stride,
one
going
out
and
a
little
bad
stuff
here,
but
it's
you
know
fast
memory,
so
so
this
is
this.
This
is
used
in
like
matrix,
transpose
libraries
use
this
so
there's
lots
of
algorithms.
Many
of
them
are
hidden
from
you
users,
because
they're
in
some
library
that
you
didn't
know
was
there.
B
Okay,
all
right
so
now
I
think
we're
in
how's
my
time.
Okay,
we're
good!
So
now
I
believe
we're
in
the
middle
of
a
third
revolution
in
both
hardware
and
software,
and
this
revolution
is
driven
by
a
number
of
things:
increased
memory.
B
Moore's
law
with
frequency
is
sort
of
gone,
but
moore's
law,
as
with
lots
of
transistors,
is
going
strong,
but
now
it's
done
with
increased
parallelism.
We
have
a
requirement
for
low
power.
Computing
and
part
of
this
revolution
is
driven
by
that,
and
and
at
least
my
goal
is.
B
Computing,
that's
one
of
my
goals.
The
other
is
to
you
know,
empower.
You
know
anybody
less
ordinary
people
who
don't
ask
access
to
nurse
to
do
really
cool
stuff
too.
Okay,
so
two
leading
contenders
are
nvidia
gpus
in
this
intel
5,
which
is
related
to
knight's
corner,
perhaps
formerly
known
as
this
one
has
a
bit
of
a
start.
So
I'm
going
to
start
with
that
one
because
it
kind
of
came
first,
so
gpus
are
graphical
processing
units
originally
invented
for
graphics
for
gaming,
actually
it
they
became
programmable
in
2007..
B
People
have
told
me
that
they
didn't
do
this
for
the
scientists
that
they
did
it,
because
the
gaming
people
wanted
to
control
special
effects
themselves,
that
the
hardware
wasn't
fast
enough
and
provided
so
they
didn't
perhaps
do
it
for
us,
but
nonetheless
we
can
take
advantage
of
it.
So
the
gpu
consists
of
about
at
least
a
high
end.
You
know
12
to
30
cmd
multiprocessor,
so
cindy
stands
for
single
instruction,
multiple
data,
which
means
it's
using
the
same
instruction
on
a
whole
bunch
of
numbers,
not
just
one.
C
B
B
They're
not
the
same
effect.
The
craze
were
really
pipelined
architectures,
but
from
a
software
perspective
we
could
care
less.
They
look
the
same
to
a
software
developer,
so
they
have
these
multiple
vector
processors,
with
a
small,
fast
memory,
typically
16
and
48
kilobytes,
and
they
have
and
the
vector
lengths
are-
and
they
have
also
a
slow
global
memory.
Now
the
slow
should
be
in
parenthesis
because
it's
about
an
order
magnitude
faster
than
the
memory
on
your
desk
on
your
mainframe
on
your
host
machine.
B
So
it's
still
pretty
fast,
that's
readable
by
everybody,
but
one
a
unique
thing
about
nvidia
in
particular,
it
has
very
fast
lightweight
threads
that
you
can.
They
can
switch
out
in
one
o'clock
period,
so
they
use
this
feature
in
a
number
of
ways.
One
of
them
is
to
handle
well
I'll
talk
about
that
a
little
later,
but
they
so
they
have
very
high
bandwidth.
Now,
it's
you
know,
100
to
200
gigabytes
per
second,
but
only.
D
B
Access
is
ordered,
in
other
words,
they're
bringing
in
a
cache
line,
so
you
got
to
take
advantage
of
that.
Secondly,
it
can
handle
thousands
of
threads
simultaneously,
and
so
you
can
use
this
to
essentially
optimize
the
memory
access.
So,
in
effect,
what
happens
is
you
have
suppose
you're
going
to
read
suppose
it
takes
200?
You
know
clocks
to
read
memory.
You
can
actually
have
200
threads
lined
up
one
or
another.
You
read
one
one
clock
period
you
go
to
the
next
thread,
hit
read
it
go
to
next
thread
and
basically
pipeline
memory.
B
This
way,
so
it's
a
way
where
you
can
almost
completely
you
know.
If
you
have
enough
threads,
you
can
actually
hide
the
memory
access,
so
the
intel
mic
is
actually
very
similar.
B
Introduced
2012,
although
knight's
corner
was
around
earlier,
but
this
is
now
a
commercial
product,
so
they
have
50
to
60
processing
units
so
twice
what
the
video
I
was
using,
but
the
vector
length
is
smaller,
so
they're
and
they
do
have
caches.
B
They
have
also
high
bandwidth,
in
fact
higher
than
the
nvidia
300
gigabytes,
and
but
they
can
only
handle
four
threads
for
so
they
they
they
don't.
They
can't
handle
as
many
threads
outstanding
as
the
nvidia,
so
they're,
similar
but
different.
B
So
basically,
what
I'm
arguing
is
that
that
these
machines
are
all
similar
enough,
that
we
can
program
to
sort
of
an
abstract
model
of
what
these
machines
are
and
not
worry
about
the
actual
hardware.
So
this
is
my
abstract
figure.
So
a
node
consists
of
a
bunch
of
vector
processors.
They
have
fast
memory
that
they
have.
D
B
Have
access
to
within
themselves,
but
they
are
connected
through
a
slower
global
memory
and
then
we
could
in
fact
connect
them
with
mpi
that
way
as
well.
So
so
this
vector
unit
works
in
lockstep.
It
has
local
synchronization.
B
So
there
is
no
global
synchronization
on
this
hardware,
although
amd
gpus
do
allow
it,
and
so
so
we
so
basically
within
one
of
these
nodes,
there's
two
levels
of
parallelism
that
you
have
to
program
to
one
is
this
vector,
and
the
other
is
the
fact
that
you've
got
a
bunch
of
these
running
simultaneously
in
parallel,
so
but
the
the
nvidia.
D
B
C
B
I
don't
necessarily
use
the
languages
that
they
provide.
In
fact,
the
different
vendors
have
different
ways
of
programming
it
anyways.
So
so
this
model
applies
to
a
wide
variety
of
processors.
So
nvidia
gpu
is
one
vector
length,
which
is
where
this
is
actually
a
virtual
vector
length,
something
they
call
a
block
size.
B
So
it
doesn't,
it
could
be
larger
than
the
hardware,
so
it's
like
a
they
might
trade
128
together,
even
though
the
hardware
might
only
have
32.,
so
the
block
size
could
be
even
bigger
than
that,
but
usually
20
128
is
typically
a
good
number
number
of
vector
processors
is
about
like
that.
There's
a
fashion
memory.
I've
showed
is
about
this
size.
Intel
phi
is
similar.
The
vector
length
is
half
the
number
of
processors
twice.
C
D
D
B
16
not
on
your
laptop,
probably
but
but
on
the
host
machine
there
might
be
4
or
16
of
them.
We
often
don't
program
these
explicitly
but
they're
there,
and
they
also
have
the
same
stuff.
Actually
even
the
prey,
xmp
back
in
the
day
had
a
vector
length
of
64
and
two
to
four
vector
processors,
but
no,
no
faster
memory.
B
So
it's
a
very
general
model
and
that's
what
we're
going
to
program
towards
so,
but
the
other
sort
of
fairly
good
news
is,
although
so,
this
model
has
a
lot
of
stuff
we're
very
familiar
with
this
is
not
the
pieces
themselves
are
put
together
in
a
different
way,
but
all
the
pieces
are
something
we
know
this
thing
here.
We
have
vector
algorithms,
which
is,
as
I
said,
or
calculation
of
elements
who
can
be
done
in
any
order.
They
have
a
lot
of
history
on
the
crank.
B
C
B
Known
and
and
then
domain
decomposition
partitioning
memory,
so
the
different
threads.
So
this
this
cindy,
you
know
this
guy
doesn't
know
about
this.
Guy
doesn't
know
about
his
memory,
and
so
we
have
a
long
history
with
mpi
with
that
kind
of
code.
So
all
the
pieces
are
actually
familiar
but
they're
put
together
in
a
new
way.
B
B
But
the
other
thing
that's
really
surprising-
and
this
this
really
blew
my
mind
away-
is
that
gpus
really
require
you
to
use
many
more
threads
than
physical
processors
and
and
for
somebody
who's
programmed
mpi
and
vector
just
say
what
or
openmp.
Why
would
you
want
to
do
that?
Because
you'd
never
do
that
with
openmp,
you
wouldn't
have
more
openmp
threads
than
you
have
processors.
C
B
Mpi
nodes:
well
it
it
has
two
really
important
reasons:
one
is
it
does
it
does
hide
memory
latency.
It
allows
you
to
pipeline
memory
and
that's
important,
but
the
other
one
even
more
unexpected
was
you
can
use
a
master
slave
model
for
automatic
load
balancing
right.
So
you
have
suppose
you
have
one
core
and
a
hundred
threads
a
little
sub
pieces
of
work
to
do
so.
Some
take
more
some
take
less
in
the
old
days
with
open
end
with
mpi.
B
You'd
have
to
worry
about
where
to
put
the
domain
to
make
sure
they
all
work.
But
here
you
just
you
know
one
guy
finishes
early.
He
gets
some
more
work,
one
guy
finishes
late.
He
doesn't
and
eventually
it
all
kind
of
statistically
balances
out.
So
this
is
this
is
well.
Velvet
is
a
kind
of
master
slave
model,
but
we
we
never
thought
of
that,
and
so
so
now,
actually
nvidia
has
a
certain
amount
of.
D
B
They're
even
having
the
ability
to
go,
give
work
from
one
multi-processor
to
another
so
that
you
have
even
more,
but
you
can
also
do
it
manually.
I
mean,
if
you
know
your
own
code,
you
can
manage
this
kind
of
stuff
yourself,
not
only
you
wouldn't
do
it
on
the
gpu,
but
suppose
you're
writing
another
language.
B
C
B
See
why
you
could
do
that
actually
by
having
these
micro
apparel
things,
you
have
a
lot
of
flexibility
about
people
so
that
I
did
that
was
unanticipated
to
me
and
a
new
feature.
So
the
other
thing
that's
optimizing.
Data
movement
is
even
more
critical
than
it
was
before.
It's
always
important,
and
now
it's
really
dead
critical
that
you've
got
to
do
this.
B
So
so,
let's
look
a
little
bit
about
some
of
these.
Let
me
do
this
algorithms,
so
here's
the
vector
solver
that
I
had
before
this
is
from
the
gpu
code.
You
can
see
there's
only
three
lines
of
code
that
changed
right,
so
so
one
is
that
the
loop
index
this
used
to
be
a
do
loop
is
now
essentially
this
thing,
and
this
is
essentially
analogous
to
the
your
node
number
and
mpi.
B
Everybody
says:
what's
my
node
number,
okay,
except
that
no
number
is
two
dimensional
in
in
cuda,
but
but
that's
basically
your
node
id
and
then
because
the
number
of
threads
actually
running
may
not
be
an
exact
multiple
of
the.
C
B
So
it's
really
easy
so,
like
I
said,
if
you
have
a
lot
of
dusty
code
from
the
crank
you're
going
to
love
that
gpu,
the
here's
a
blocking
tiling
algorithm,
so
this
is
used
in
the
transpose
the
transpose
on
cuda
is
even
worse
than
because
it's
reading
data
in
big
blocks,
even
bigger
than
the
cache
lines.
B
E
B
Way
is
even
more
credible,
so
here's
the
original
fortran
this
is
cuda
fortran.
By
the
way
you
could
do
cuda
with
fortran,
and
so
you
have
a
shared
memory
which
you
allocate,
and
you
declare
this.
The
scrap,
temporary
memory
and
here's
the
loop
index
so
now
become
these
thread.
B
Ids
you
copy
this
there's
just
you
know
the
if
statement
replaces
I
should
have
black
put
that
in
black,
because
because
many
threads
are
working,
you
have
to
have
a
synchronization
point,
because
you
don't
want
to
be
not
everybody's
writing
this
at
the
same
time.
So
you
want
to
make
sure
that
you
don't
start
reading
it
before
everybody's
had
a
chance
to
write
it,
and
then
you
so
here
the
loops
parameters,
flipped
and
here
the
same
j
now
becomes
percent
y,
and
here
it
was
percent
x.
B
So
that's
that's
equivalent
of
how
you
flip
these
things
and
and
so
on.
So
that's,
basically,
you
know
fairly
straightforward
sort
of
copy
in
this
this
cuda
language.
One
thing
that's
also
interesting.
You
notice
that
this,
but
I
have
my
buffer,
really
needs
n
block
times
n
block
data.
When
I
made
the
leading
dimension
number
one
one
bigger
well,
that's
an
old
crate
trick
to
avoid
memory.
Bank
conflicts,
because
this.
D
B
Has
the
same
kind
of
memory
bank
conflicts
that
the
cra
has
so
here's
another
old
story
coming
back
and
same
old
trick
that
you
use
you
just
make
this
not
a
power
to
say
not
a
priority,
make
it
65
instead
of
64
and
most
of
the
bank
conflicts
disappear.
So
so
another
new
idea,
that's
really
an
old
idea.
B
I
guess
you
don't
have
the
font.
I
used
on
your
on
your
version
of
keynote
yeah.
I
had
a
special
thought:
okay,
so,
unfortunately,
this
is
the
only
equation
in
my
time
the
so
now
I'm
gonna.
So
what
I've
been
talking
about
is
sort
of
fairly
general.
I'm
going
to
talk
a
little
bit
specifically
about
the
pick
code
that
we
have
and
what
we've
done
more
in
more
detail
for
the
for
the
gpu.
B
So
here
we,
this
is
kind
of
a
cartoon
of
pink
code,
really
has
three
steps:
there's
a
poisson
equation,
you
deposit
a
charge
density
and
then
you
advance
the
particles
after
you
solve
the
electric
field.
This
is
kind
of
the
easiest
one,
but
so
basically
there's
interpolation.
B
So
if
a
particle
is
here,
then
you
will
deposit,
you
will
distribute
its
charge
to
the
four
nearest
grid
point
and
if,
if
when
you
get
to
the
acceleration,
the
particles
here,
you
will
interpolate
the
electric
field
from
those
four
nearest
good
points
and
the
problem
with
the
push
there's
no
problem
but
for
the
deposit.
If
there's
another
thread
working
on
these
four
points,
then
they
share
this
grid
point
in
common
and
they
could
be
re
writing
to
the
same
grid.
At
the
same
time,
and
that's.
B
So
that's
a
problem
so
for
gpus
it
became
clear
early
on
that,
the
because
we
have
low
computational,
intensity
and
and
the
ratio
of
the
memory
bandwidth.
Even
though
it's
an
order
of
magnitude
higher
than
the
on
the
main,
the
processors
are
even
faster.
It's
a
teraflop
machine,
and
so
so
the
pick
codes
really
have
low,
computational
intensity.
They're
really.
D
B
D
B
Streaming
algorithm,
where
we,
basically
the
data
part
of
the
day
and
the
field
data
would
move
in,
you
would
process
it
move
it
out
and
never
read
it
again.
It
just
was
one
one
pass
through
so,
and
the
only
way
I
could
think
of
to
do
that
is
that
the
particles
have
to
be
ordered.
So
all
the
particles
within
one
little
pile
have
to
be
processed
at
the
same
time.
Okay,
and
so,
if
that
were
true,
then
things
would
really
be
easy.
You.
C
B
C
C
D
B
D
B
Did
work
so
we're
really
pretty.
We
think
we're
we're
in
good
shape
now
so
the
particles,
then
you
divide
space
into
these
micro
tile.
These
tiles,
in
this
case
I'm
drawing
two
two
by
three
cells
in
each
tile
and
all
the
particles
in
that
are
stored
together.
B
So
the
way
this
thing
works
is
when
we
at
the
when
we
enter
when
the
the
code
for
the
tile
each
tile
is
done
in
parallel
and
independently.
So
when
you
process
the
tile,
you
first
copy
the
fields
from
global
memory
to
fast
memory.
So
there's
a
so
that's
the
one,
and
only
great
read
for
that
when
you
deposit
it,
you
also
write
the
global
memory.
When
you
push
you,
don't
all
these
tasks
can
be
done
in
parallel.
B
There's
some
guard
cells
that
aren't
in
the
picture
so
basically
for
push
it's
trivial,
it's
just
very
similar
to
mpi
code,
except
that
partitions
are
tiny
for
deposit.
You
have
this
possible
data
collision,
so
it
turns
out
it's
easy
if
each
tile
is
just
controlled
by
one
thread,
so
it
depends
on
how
you
assign
the
threads
to
what.
So,
if
you
give.
B
To
each
tile,
there's
no
data
hazards
and
everything
works,
but
if
you
want
to
get
better
performance
then
you
can
use
a
vector
of
tiles
on
one
type
I
mean
a
vector
of
threads
on
one
tile
and
then
data
collisions
if
possible.
So
this
only
works.
If
the
system
supports
atomic
updates,
an
atomic
update
means
you're
going
to
treat
an
update,
you're
going
to
add
one
to
something
that
and
it's
not
interruptable
right,
it'll
be
complete
before
someone
else
tries
to
add.
B
D
B
The
way
we
did
it
was,
we
first
create
a
list
of
particles
that
are
leaving
a
tile,
then
using
the
list.
Each
thread
places
the
outgoing
particles
into
an
ordered
particle
buffer
and
then
there's
a
synchronization
and
then
you're
reading
from
that.
So
let
me
just
draw
a
picture
here.
So
here's
a
tile
and
there's
a
bunch
of
particles
going
in
these
eight
different
directions
direction,
one
two
three!
B
So
all
the
ones
that
are
going
in
one
are
stored
together
in
this
buffer
and
the
two
are
stored
next
to
it
in
this
bubble
and
so
there.
So
this
is
an
ordered
buffer.
So,
with
the
mpi
code
we
had
a
left
and
right
buffer.
Here
we
do
all
of
them
simultaneously,
and
so
all
of
these
directions
are
done
at
once,
not
just
first
there's
not
a
loop
over
that
they're
done
simultaneously
and
then,
when
you
go
read
them.
Everybody
knows
whoever
this
guy
needs
to
get
data
from
eight
different
directions,
but
he.
D
B
Exactly
where
to
go
to
get
them,
so
you
don't
have
to
search
or
anything
like
that,
so
this
all
runs
in
parallel.
All
the
tiles
are
running
in
parallel
and
these
guys
are
done.
C
B
So
that's
that
was
the
hard.
C
B
We've
we've
here's
some
benchmark
from
this
code.
We've
also
implemented
an
openmp.
So
if
you
throw
away
the
vectorization
and
just
write
a
serial
version
of
that,
then
the
blocks
become
the
openmp
loops
and
you
can
run
it
on
a
traditional
architecture.
So
here's
the
original
code.
This
is
a
2d.
This
is
the
same
one.
I
did
in
1975
more
or
less
a
bigger
problem
size
though,
and
so
it
takes
about
31
nanoseconds
on
one
core
of
an
i7
800
picoseconds
on
and
with
openmp,
on
12
cores
that
we
have
on
our
dawson
system.
B
B
T3E
that
we
had
16
years
earlier,
the
cost
is
now
7
times,
10
to
the
minus
8
cents
and
not
30
cents.
So
it's
gone
down
nine
orders
of
magnitude,
so
that's
good
and
the
gpu
card
will
fit
in
your
desktop
okay.
So
this
problem,
which
is
150
million
particles,
are
running
on
one
gpu
is
a
problem
that
was
similar
to
the
size
of
the
numerical
tokema
project.
We
did
on
that
t3e,
so
we
had
100
million
particles
with
the
biggest
that
we
did
so
now.
B
D
B
Speed
up
this
is
about
the
other.
One
was
about
three
thirty
times
faster
from
a
single.
This
is
about
fifty
for
that.
I'm
gonna
perhaps
skip
over
the
details
of
this
multiple
gpu,
because
I'm
kind
of
getting
low
on
time
here,
but
there.
B
C
B
A
graduate
student,
our
group
has
taken
the
code
of
osiris,
which
is
our
workhorse
pig
code
for
laser
plasma
interactions
and
plasma
accelerators,
and
he
implemented
this
algorithm,
so
the
code
that
I'm
testing
was
actually
a
skeleton
code.
So
there's
a
bare
bones
code
that
I
use
for
testing
and
evaluating
developing
new
architectures,
so
he
took
those
algorithms
put
them
into
osiris
and
he's
running
here
now
on
up
to
225
gpus,
and
this
is
a
kind
of
weak
scaling
and
it's
very
very
good.
B
I
mean
it's,
it's
almost
perfect
weak
scaling
across
these
these
gpus.
So
this
problem
here
that
he's
running
is
got
22
billion
particles.
This
is
on
our
dawson
cluster
that
has
like
300
gpus,
so
22
billion
particles
and
it
takes
about
10th.
D
B
D
B
Is
very
problematical
because
gpu
has
you
know:
500
000,
cores
and
you've
got
to
keep
them
all
busy
or
it's
worthless
right.
So
when
you
do
strong
scaling,
you're,
reducing
the
problem
size
and
it
doesn't
take
long
before
it's
just
way
too
weak
for
gpu
to
do
anything
with
it,
because
you
can't
keep
the
threads
all
busy.
So
it's
hard
to
do
strong
scale,
you're
really
kind
of
pushed
into
you
know,
really
doing
weak
scaling
and-
and
so
your
problem
size
determines
really
how
many
gpus
you're
going
to
use.
B
The
goal
is
to
provide
and
document
parallel
pick
codes
and
make
them
open
source,
so
I
do
not
have
time
to
really
go
through
the
activities
that
this
center
is
funded
to
provide.
But
the
key
thing
is
the
components
and
codes,
both
skeleton
codes
and
otherwise
will
be
available
through
open
source
licenses,
and
it's
intended
to
be
a
community
resource
for
pick
for
for
anybody,
and
we
welcome
contributions.
B
For
this
activity,
I
will
show
you
we:
this
is
just
getting
going.
We
don't
actually
have
a
website
yet,
but
we
do
have
a
website
that
has
some
of
these
skeleton
codes,
so
these
have
been
available
for
a
few
months
now.
So
these
are
intended
to
be
teaching
codes
and
they're
intended
to
teach
people
what
a
pig
code
is
and
they're
also
intended
to
teach
people.
D
C
B
C
B
There'll
be
three
levels
of
parallels:
coming
we're
interested
in
other
languages
and
collaborating
with
people.
For
example,
how
are
you
going
to
express
three
levels
of
parallelism?
Is
there
a
language
out
there
that
can
do
that
would
be
interested
to
collaborate
with
people
to
look
at
those
issues.
So
these
all
the
codes
they
talked
about
today,
except
for
osiris
or
available
here
on
the
website.
You
can
download
it.
This
cuda
code
is
actually
cool
to
see
and
who
the
fortran
will
be
there,
probably
in
a
week
or
two.
B
B
So
that
includes
vector
techniques
blocking
tiling
message
passing
one
thing
I
didn't
mention
too
much
is
data
parallel
techniques.
This
is
basically
vectorization
with
array
syntax.
There
wasn't
much
history
at
nurse
with
this,
but
I
used
this
on
on,
like
the
thinking
machines,
the
connection
machine
that
they
had
that
well
developed
hpf
fortran
also
had
that
kind
of
stuff,
so
those
are
actually
fairly
useful
and
I
use
some
of
those
kinds
of
techniques
as
well,
which
I
didn't
talk
about
too
much
so
programming
for
this
hardware.
Abstraction
leads
to
common
algorithms,
so
we.
C
C
B
The
way
that
openmp
version
also
keeps
particles
ordered
so
one
of
the
things
and
pick
codes
we
sometimes
want,
is
to
add
additional
operators
like
collision
operators
and
to
do
that
has
been
expensive
because
you
have
to
find
where
your
neighbors
are
well.
Now
it's
free
the
neighbors
are
there.
You
know
where
they
are,
because
everything
is
in
these
microtiles.
C
B
B
At
the
very
beginning
of
the
code
yeah,
so
what
we
do
is
with
the
the
code
is
initialized
on
the
host,
and
so
then
we
just
have
a
a
routine
that
goes
and
checks
each
particle
one
by
one
to
see
which
tile
it
wants
to
and
just
moves
into
that
tile.
So
that's
not,
but
that's
only
done
once
at
t
equals
zero.
So
is
that
what
you're
talking
about
well.
B
Oh
yeah
for
this
well
yeah,
so
that
that's
a
physics
question
right,
right,
yeah
yeah!
So
this
these
skeleton
codes
are
just
uniform
right,
but
the
osiris
code
has
different
ways
of
doing
non-uniformity
the
way.
The
way
I
do
it
personally
is:
I
should
do
a
root
finding
scheme.
I
have
a
distribution
function
and
I
I
look
for
the
roots
of
where
I
integrate
the
particle
until
I
get
an
integer,
you
know
because
I
the
way
I
normalize
density
so
so
the
total
integral
is
the
total
number
of
particles.
B
So
whenever
I
get
one
that's
where
the
particle
goes,
and
so,
but
cyrus
has
a
different
way:
they
actually
weight
the
charge
kind
of
inversely
to
the
density.
So
that's
a
different
way
of
doing
it
too.
So
so
there
you
just
you
just
do
an
integral,
but
that's
all
part
of
the
initialization.
So
none
of
that
we
do
on
the
gpu.
So
that's
all
done.
A
So
you
put
a
lot
of
work
into
with
your
skeleton
code
as
you
described
it,
and
then
you
took
the
results
of
that
or
somebody
took
it
and
integrated
into
osiris.
So
would
you
characterize
the
amount
of
work
needed
to
to
put
to
bring
osiris
into
these
architectures
as
easy,
difficult,
major
minor.
B
Well,
I
think
medium
I
mean
well
adam,
also
added
some
additional
optimizations
that
I
didn't
have
because
he's
really
kind
of
gung-ho
about
this
stuff.
So
I
mean
he,
he
went
read
through
the
assembly
language
out,
but
he
really
got
into
it,
but
but
in
general
the
hard
one
was
the
sorter
and
the
reorder.
So
you
can
use
that
as
a
black
box.
B
I
don't
think
you
need
to
know
how
it
works
right
and
then
the
push
and
the
deposit
are
pretty
straightforward,
because
they're
largely
the
same
old
code
except
there's,
another
layer
where
there's
an
outer
loop
over
tiles
and
an
inner
loop
over
the
number
of
particles
in
the
tile
and
there's
a
temporary
copy
to
a
local
array.
Otherwise
it's
the
same
algorithm.
So
I
would
say
you
know
for
at
least
on
one
gpu.
B
It
should
be
fairly
straightforward
and
on
my
website
I
have
a
little
kind
of
diff
program
that
shows
you
what
things
I
changed
to
go
from.
You
know
one
version
to
another,
so
you
could
look
at
it.
So
I
hope
that
it's
not,
but
you
know
there,
there
are
little
details
that
are
kind
of
annoying,
that
on
the
gpu
every
thing
the
subroutine
it
becomes
two
subroutines
because
there's
a
host
subroutine
that
launches
the
kernel
and
what
was
inside
your
loop
is
now
on
a
different
place.
B
So
there's
kind
of
these
annoying
things
that
that
are,
you
have
to
kind
of
worry
about,
so
it
will
take
a
while,
but
I
think
there's
neither
the
push
nor
the
deposit
have
anything
terribly
difficult.
I
mean
there's
some
new
things
you
have
atomic
updates,
but
that's
just
a
subroutine
call
and
so
on.
So
I
hope
the
hard
one
was
the
reordering,
and
I
hope
you
never
have
to
look
at
that.
D
My
question
is:
you've
done
all
that
work
with
cuda.
How?
What
do
you
think
about
the
you
know,
openmp
like
open
acc,
I
mean:
do
you
think
they're
gonna
be
mature
enough
that
you
know
you
could
be.
B
Yeah
they're
not
there,
certainly
now
and
and
if
you
frank
tsung
actually
is
here:
yes,
he
is
so
he's.
C
B
He's
been
actually
playing
with
with
the
open
acc,
it
just
looks
like
they
don't
support
everything
that
you
can
control
with
it.
It's
kind
of
immature,
would
you
say
that's
fair
right,
so
so
victor
so
on
pictures
of
code.
E
E
Once
you've
done
that,
and
you
use
the
compiler
the
compiler
just
going
to
tell
you-
I
can
do
it,
but
then
at
that
point
you're
stuck
because
it
doesn't
give
you
provide
hints
as
to
how
one
how
to
get
it.
So
the
compiler
understands
what
you're
trying
to
accomplish
and
trying
to
optimize
it
for
the
gpu.
E
B
B
Phi
to
see
how
that
worked
and
the
parallelization
part
it
worked
great.
It
didn't
vectorize
anything
out
of
the
box.
So
clearly
we
need
to.
That
seems
to
be
the
thing
to
work
on
for
a
five.
But
but
if
you
think
the
structure
of
the
code
is
correct,
it'll
work
on
the
pi,
the
parallel
part
does
work.
The
vectorization
does
nothing
so
at
the
moment,
but
we
know
vectorizes
because
the
cuda
has
vectorized
it.
So
no
it's
possible.
B
B
I
don't
really
care
who
the
winner
is
in
terms
of
the
actual
vendor.
I
think
we
can
make
it
work
right
because
I
think
they're
all
going
to
be
like
that.
You
can
see
that
the
cindy
units
have
to
be
there
because
of
energy.
I
mean
I
mean
it's
just
inevitable-
that
they're
going
to
look
something
like
that.
I
just
don't
know
exactly
and
the
languages
are
catching
up.
I
mean
I
they're
not
there.
Yet
I
mean
we
with
mpi.
B
B
Moving
target,
I
think
the
abstractions
of
how
you
do
two
levels:
apparel
is
still
not
totally
there.
It's
new
people
haven't
been
doing
that
for
a
long
time,
so.
C
So
you
kind
of
have
a
flaw
on
the
amount
of
parallel
you
kind
of
have
a
flaw
on
the
amount
of
parallelism
you
can
exploit
with
this
sorting
algorithm,
which
is
just
the
number
of
grid
points.
Can
you
physic
and
does
the
physics
justify
just
increasing
the
number
of
grid
points
if
you
end
up
with
more
the
need
for
more
sorry,
if
you
end
up
with
more
hardware
courses,.
B
Right
exactly
yeah
the
size
tile
is
clearly
arbitrary
right,
and
so
we
we
just
actually
when
I
did
those
better.
I
just
scanned
through
and
found
the
one
that
was
optimal,
there's
actually
two
competing
things
going
on
on
gpus
one.
Is
that
the
bigger
the
tiles,
the
more
shared
memory
you
need
now?
The
gpus
are
trying
to
run
multiple
independent
blocks
simultaneously,
so
they
call
this
occupancy.
B
So
the
bigger
the
shared
memory
you
request,
the
smaller
the
occupancy
that
you
have.
On
the
other
hand,
the
the
bigger
the
tile
is
the
faster.
The
sorter
goes
because
very
few
particles
leave,
so
the
reordering
is
very
fast,
so
so
there's
typically
a
peak
somewhere
in
there.
Where
you,
you
know
your
occupancy
goes
down
and
but,
but
you
know
so
right.
C
B
It's
typically
16
by
16
is
what
we're
using
most
of
the
time,
but
it
probably
wiggles
a
bit
if
you're
using
one
thread
per
per
tile,
like
we
did
on
on
the
not
pre-fermi
architecture,
then
the
the
tiles
are
more
like
two
by
three,
because
only
one
thread
is
responsible
for
tiling
quickly
run
out
of
shape
memory.
So,
but.