►
From YouTube: Programming Models for GPU
Description
Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/
A
A
So
you
have
a
few
number
of
very
fast
threads
that
can
run
at
high
frequency.
You
have
large
caches
and
a
lot
of
silicon
dedicated
to
supporting
things
like
branching
and
switching
back
and
forth
between
different
types
of
code.
A
On
the
other
hand,
a
gpu
is
a
throughput
oriented
device,
so
it's
really
suited
for
parallel
work,
particularly
when
you're
doing
the
same
operation
on
many
elements
so
grid
points,
particles,
anything
where
you're
applying
the
same
thing
to
to
many
many
different
elements
of
data,
it's
usually
a
good
fit.
It
has.
A
More
threads,
but
individually,
each
of
those
threads
is,
is
less
powerful
than
a
cpu
thread.
The
memory
capacity
is
is
smaller,
but
much
much
faster
as
jack
pointed
out.
A
This
is
a
heterogeneous
model
right,
so
you
you're
still
executing
serial
cpu
code,
but
interspersed
with
that
you're
you're
launching
or
offloading
work
to
to
a
device,
and
so
what
this
kind
of
means
is
in
general,
you
should
keep
latency
sensitive
and
serial
work
on
the
the
cpu
and,
as
jack
mentioned,
the
the
cost
for
moving
data
between
the
device
and
the
host
can
be
quite
high.
So
it's
always
best
to
keep
the
data
wherever
it's
used,
so
either
device
or
host.
A
So
what
was
the
landscape
of
of
these
different
models?
Look
like.
I
sort
of
arbitrarily
chose
two
axes
to
slice
this
across.
So
on
the
on
the
horizontal
axis.
We
have
ease.
B
A
Use
or
level
of
control,
which
is
also
a
bit
of
a
proxy
for
a
number
of
features
or
what
your
possibilities
are,
and
so
all
the
way
on
the
very
right
we
have
cuda,
which
is
the
native
programming
model
for
nvidia
gpus,
but
it's
also
the
lowest
on
the
portability
axis,
because
that's
there's
really
one
one
main
compiler
llvm
can
do
it
so
one
or
two
compilers
that
can
support
cuda
and
it
really
only
runs
on
nvidia
hardware
and
then
still
offering
quite
a
lot
of
control
and
and
features,
but
a
little
bit
less
I'd,
say
verbose
than
writing
kind
of
raw
cuda
are
your
c
plus
plus
frameworks.
A
So
these
are
things
like
cocos
or
sickle
and
we'll
touch
on
those
as
well.
Then,
the
next
a
bit
easier
in
my
opinion,
to
use
than
those
in
in
several
situations
are
directive
based
models.
So
these
are
you
know
something
if
you've
been
writing
openmp
code
on
cpus,
it's
it's
similar.
You,
you
have
a
set
of
directives
that
allow
you
to
offload
work
as
well
and
open
acc
is,
is
a
kind
of
a
similar,
similar
story
to
openmp.
A
It's
a
directive
based
model
and
then
finally,
I've
put
the
some
fortran
and
c
plus
logos
up
here,
because
there's
increasing
support
for
offload
and
parallelism
directly
in
those
standards,
so
in
c
plus
plus
there's
the
the
parallel
standard
template
library.
A
That's
many
features
really
showed
up
in
c
plus
plus
17,
and
this
can
be
a
really
powerful
approach
and
that
will
allow
you
to
express
some
parallelism
that
can
run
on
many
different
platforms.
I
think
even
microsoft
compilers
support
that
and
then
same
thing
for
fortran
there's.
There's
do.
A
A
So,
let's
start
with
the
the
native
model
or
or
cuda,
and
I
think
this
is
a
good
place
to
start,
because
it
really
serves
as
a
reference
point
for
for
all
the
other
models
and
knowing
knowing
what's
going
on
at
least
at
a
high
level
in
cuda,
can
help.
You
understand,
what's
happening
behind
the
scenes
for
the
higher
level
models.
A
So
in
terms
of
benefits
and
pros
and
cons,
you
know
the
the
obvious
pro
is
that
cuda
is
co-designed
with
nvidia's
hardware,
so
you
get
full
control
and
direct
access
to
essentially
every
feature
of
an
nvidia
gpu
by
using
cuda,
typically.
B
A
So
I
think
the
starting
with
cuda
from
the
very
basics
cuda
is
an
extension
of
c
plus
so
cuda
c
or
c
plus
plus,
is
an
extension
of
the
base
language
and
the
the
real
key
thing
that
it
provides
is
this
extension
called
a
kernel
and
what
a
kernel
does
is
it's
a
it's
like
a
regular
function,
except
it's
executed
some
number
of
times
in
parallel
by
different
cuda
threads,
and
you
indicate
that
you
have
a
kernel
with
this.
A
A
So
one
kernel
consists
of
a
grid
of
blocks
and
this,
this
grid
of
blocks
can
be
one
two
or
three
dimensional
and
then
within
each
block
you
have
a
set
of
threads
that
can
also
be
indexed
in
one
two
or
three
dimensions,
and
so,
when
you
launch,
when
you
launch
a
kernel,
you
say
how
many
blocks
you
want
and
how
many
threads
per
block
and
those
parameters
can
either
be
integers
or
these
dimension
three
types
for
for
the
higher
dimensions
and
that
lets
you
really
map
the
that.
A
So
those
threads
in
so
that
thread
hierarchy
comes
with
a
corresponding
memory
hierarchy.
That
jack
touched
upon,
so
the
global
memory
or
you
might
hear
hbm.
This
is
the
memory
on
the
gpu,
so
that's
accessible
to
two
different
kernels,
so
multiple
kernels
or
different
grids
can
can
each
access
that
global
memory.
A
A
So
this
this
talk
isn't
a
deep
dive
into
the
the
details
of
cuda
programming.
There's
a
huge
amount
of
content
available
online
about
this.
A
I
do
recommend
that
anybody
who's
going
to
do
nvidia,
gpu
programming
should
read
at
least
the
first
few
sections
of
the
the
cuda
programming
guide
and
that
will
really
help
baseline
your
your
knowledge
of
what's
going
on
with
the
device,
and
I
can
definitely
recommend
it.
It's
it's
it's
possibly
more
more
accessible
than
you
think
and
then
there's
also
the
nvidia
blog
and
gtc
talk
and
slides
series
that
I
definitely
recommend
and
then
I'll
kind
of
end
with
a
tip
or
maybe
a
warning.
A
So
there's
a
huge
amount
of
content
available.
If
you
go
online
and
look
for
cuda
I'll,
just
caution
you
to
definitely
check
the
dates
of
whatever
content
you're
looking
at,
because
cuda
has
definitely
changed
in
in
some
cases
significantly
over
the
years
with
new
features-
and
you
know,
restrictions
relaxed
et
cetera,
so
definitely
check
the
date
and
make
sure
you're
you're
reading
kind
of
a
modern
source
that
that
doesn't,
you
know,
give.
B
A
So
moving
moving
along
some
on
my
my
through
the
landscape
here
onto
c
plus
plus
frameworks,
so
these
are
typically,
these
are
usually
built
as
cross-platform
abstraction
layers
and
they
they
give
you
a
set
of
modern
c,
plus,
plus
abstractions
and
primitives,
that
you
can
compose
to
to
express
your
application
and
they
they
tend
to
target
accelerators
and
cpus
from
from
multiple
vendors.
A
A
Are
you
have
to
write
c
plus
plus,
if
that's
that's,
not
really
a
downside,
but
that
it
could
be
if
your
application's
in
portrayant
site
they
require
some
amount
of
buy-in?
B
A
For
for
a
good
interface
setup,
which
can
be
tricky,
they
can
often
come
with
a
learning
curve
and
in
some
cases
these
are.
These
are
really
new
or
up
and
coming
kind
of
projects
that
may
or
may
not
have
direct
vendor
support,
and
so
you
know
if
ecosystem
maturity
and
the
ability
to
to
pay
somebody
to
work
on
it.
For
you,
contractually
is
important,
then
that
could.
B
A
Cocos
is
a
is
a
project,
it's
largely
run
out
of
sandia,
but
has
broad
support
within
the
department
of
energy
and
is
built
as
an
ecosystem.
That
includes
a
programming
model
and
abstractions,
and
then
a
number
of
libraries
and
tools,
support
that
come
with
it
and
sickle
is
a
is
a
class
cross-platform
attraction
layer.
That
is
largely
at
this
point.
In
time
being
a
lot
of
support
is
coming
from
intel.
A
This
is
the
native
programming
model
for
the
upcoming
aurora
system,
but
it's
it's
not.
Proprietary.
Sickle
is
a
standard
own
bike.
It's
a
standard
run
by
kronos
group,
which
is
not
intel.
So
it's
an
independent,
open
standard
it'll
be
familiar
to
anyone.
Who's
done
work
with
opencl,
it's
kind
of
like
the
c
plus
version
of
that.
A
A
The
nnsa
labs
have
in
particular,
really
embraced
this,
and
this
is
it's
all
open
and
available.
You
can
go
to
github.comcocos
and
they
have
a
really
great
set
of
tutorials
and
examples
available
and
an
extremely
helpful
slack
channel.
But
I
and
then
I've
also
included
here
the
reference
to
their
their
most
recent
paper,
which
which
really
outlines
all
the
different
capabilities
that
that
are
available,
and
then
I
can
also
recommend,
if
you,
if
you
google,
for
this
gtc
talk.
A
So
just
to
to
kind
of
dive
in
and
give
a
little
bit
of
a
flavor
of
what
what
cocos
is
about
it.
Some
of
the
main
abstractions
are
are
views,
memory,
spaces
and
execution
spaces,
so
view
is
like
a
shared
pointer
to
multi
multi-dimensional
data,
that
is
in
a
particular
memory
space,
and
then
it
comes
with
the
layout,
and
so
what
what
a
layout
means
is
basically,
which
which
index
is
the
fast
one?
A
So
here's
a
vector
edition
example
that
you'll
see
everywhere
implemented
in
cocos.
One
thing
I
want
to
point
out
is
that
this
entire
code
is
is
know
cocosified.
If
you
will.
A
A
And,
finally,
I
even
use
the
the
cocos
reduction
pattern
to
compute
a
final
sum,
and
so
at
this
point
I
want
to
mention
that
you
know
in
this
toy
1d
example.
I
don't
really
take
advantage
of
the
the
layout
abstraction,
but
with
with
more
dimensions
that
hiding
of
the
organization
of
data
and
memory
allows
for
cache
utilization,
good
cache
utilization
on
cpus
and
allows
for
coalesce
memory
traffic
memory
transactions
on
gpus.
A
A
This
is,
this
is
actually
joint
work
between
nersk
alcf
and
a
company
called
codeplay,
where
we're
we're
actively
targeting
support
for
a100
and
that
that
project
is
is
really
progressing,
is
underway
now,
and
it's
progressing
really
nicely,
but
it's
still,
it
means
that
the
support
is
is
brand
new,
but
it's
it's.
A
The
feature
set
is:
is
is
really
good
so
again,
dpc
plus
plus,
which
is
an
implementation
of
sickle,
is
the
native
model
for
for
aurora,
and
so
that
that
particular
compiler,
which
is
based
on
lvm,
is,
is
directly
supported
by
intel.
A
If
you
want
to
use
sickle
at
nursk,
we
also
can
obtain
some
support
for
that
through
our
contracts
with
codeplay-
and
this
is
you
know
again-
it's
a
different-
it's
like
a
slightly
different
flavor,
but
it's
also
modern
z,
plus
plus,
or
it's
based
on
modern
c
plus
plus,
and
this
could
also
be
a
good
option
for
for
anyone,
who's
really
familiar
with
opencl
it'll.
Many
of
the
concepts
like
a
cue
will
feel
really
natural.
A
You
have
the
dpc
plus
plus
it's
based
on
open
source
llvm.
Like
I
mentioned,
there
are
proprietary
compilers
from
codeplay.
A
A
Support
for
these,
like
vector
engines
through
that,
so
that
this
particular
corner
of
the
slide
is
is
where
nurse
can
hold.
Alcf
are
working
together
so
right
now
this
is
a
public
fork
of
llvm,
but
the
the
eventual
aim
is
the
inclusion
in
the
main
projects
and
we're
we're
working
on
targeting
a
back
end
that
that
generates
ptx
code
directly.
So
you
can
still
get
very
high
performance.
A
And
since
it's
this
is
open
source,
so
anybody
with
the
recent
nvidia
gpu
will
will
also
get
the
benefits
of
this.
Although
we
are
targeting
a
100
and
we're
we're,
also
developing
a
lot
of
extensions
to
to
enable
access
to
some
of
the
key
a100
features
like
the
tensor
cores
and
some
of
the
asynchronous
operations
that
are
maybe
familiar
to
those
who
have
done
a
lot
of
cuda
programming.
Recently.
A
Okay,
so
with
without
going
too
much
further,
what
does
this
code
look
like
and
if
you
squint
a
little
bit
it's
it's
very
similar
to
the
cocos
code
shown
earlier.
A
A
C
A
A
A
So
here
you
just
declare
sort
of
shared
pointers
that
can
be
used
anywhere
and
all
that
buffer
and
accessor.
C
A
Is
gone,
which
makes
it
much
more
shortens
the
code
significantly,
but
it
could.
It
could
result
in
slightly
less
performance.
So
if
you
tried
sickle
before
and
thought
that
it
was
too
verbose,
the
the
latest
version
definitely
improves
that
situation.
A
So
as
for
cycle
at
nurse
it
it
can
compile
and
run
today
we
don't
have
a
full
module
file
available
at
this
exact
moment,
but
I'm
happy
to
make
one
available
to
you
and
we
definitely
want
to
hear
if
you're
interested
in
sickle.
I
definitely
want
to
hear
from
you.
A
I
want
to
touch
briefly
on
the
parallelism
built
into
the
languages
itself,
so
in
fortran,
if
you
write
a
do
concurrent
loop
like
this
or
you
use,
array
intrinsics
like
matrix,
multiply,
reshape
transpose
et
cetera
and
I
think
they're
adding
more
all
the
time.
A
Then
you
can
just
compile
with
this
good
par
option
with
nvidia
fortran,
and
that
will
give
you
parallel
code
offloaded
to
to
gpu
and
with
with
no
changes
at
all
to
your
code.
That's
just
100
iso
fortran
and
then
I
think
nvidia
might
be
the
only
compiler
that
does
gpu
offload,
but
there
are
a
number
of
other
compilers
like
intel
which
will
generate
parallel
cpu
code
in
some
cases
with
with
these
do
concurrent
structures.
A
And
then,
switching
back
to
c,
plus
plus,
I
think
that
this
is
a
little
bit
more
full
feature
than
the
support
in
fortran.
At
the
moment
you
have
all
of
these.
You
have
a
bunch
of
really
powerful
parallel
algorithms,
so,
like
transform,
transform,
reduce
you.
B
A
Scans
for
each
et
cetera,
and
if
you
check
out
the
the
numeric
algorithm
and
execution
headers
on
your
favorite,
you
know
c
plus
plus
reference
site.
A
A
So
so
here's
an
example
of
that
with
again
that
same
simple.
Let's
add
two
two
vectors
together,
so
here
I
I
make
some
vectors
fill
them
up
with
some
data
on
the
host
and
then
on
the
device.
I
say
I
want
to
do
the
transform
algorithm
with
the
parallel
unsequenced
execution
policy
on
on
this,
these
iterators
or
those
vectors
and
then
adding
them
together
is
the
operation.
A
I
want
to
do
in
this
expressing
this
lambda
here
and
then
that's
it,
and
then
this
this
will
get
turned
into
parallel
code
by
nbc,
plus
plus
and
execute
on
the
gpu
you
can.
You
can
compile
the
same
code
with
intel
and
get
parallel
cpu
code
and
so
on.
B
A
There's
I
won't
be
doing
it
a
deep
dive
on
any
of
these,
but
just
a
few
of
the
the
things
to
keep
an
eye
on
so
atomic
reference.
It
is
a
great
and
powerful
thing
that
allows
some
access
to
data
to
be
atomic
without
requiring
all
accesses
to
be
atomic.
A
So
often,
you'll
have
some
array
that
you
don't
want
to
use
access
atomically
in
all
situations
and
because
that
has
a
huge
performance
impact,
there's
a
new
standard
split
phase
barrier,
which
is
really
useful
for
coordinating
asynchronous
work.
A
With
iterators
speaking
of
iterators,
there's
often
a
need
to
iterate
over
multiple
collections
together
or
need
to
do
something
based
on
some
index,
and
so
for
that
you
need
either
a
zip
or
a
counting
iterator.
If
you.
A
Instead
of
writing
them
yourself,
you
can
get
them
out
of
you,
know,
library
like
thrust
or
boost,
usually.
A
And
then
another
thing
to
watch
is
is
mdspan,
which
is
a
multi-dimensional
span,
objects
which
is
supposed
to
make
working
with
multi-dimensional
data,
which
happens
quite
often
in
science
better.
A
I
also
want
to
say
this:
isn't
an
exhaustive
list,
there's
tons
of
there's
tons
of
new
proposals
and
features,
but
these
are
just
the
ones
that
happen
to
be
on
my
radar,
so
it
it's
a
great
approach,
but
there's
definitely
some
caveats
and
limitations.
A
You
know
you're
you're,
still
waiting
on
really
brand
new
modern
features
to
be
implemented.
You
have
to
use
a
very
modern
version
of
c
plus
plus,
which
could
be
an
issue
for
some
legacy
codes
and
then
there's
other
cases.
If
you
have
hierarchical
parallelism
or
getting
that
last
bit
of
performance,
that
can
definitely
be
difficult.
With
these
standard
language
approaches.
A
Okay
and
at
this
point,
I'd
like
to
turn
over
the
the
presentation
to
chris
who's
going
to
talk
to
us
about
openmp
and
directive
based
models.
B
C
C
Thing
at
the
top,
like
the
window
decoration
for
mac
like
the
close
button
things
like
that,
it's
just
a
minor
thing,
don't
worry.
A
It's
probably
your
individual
thing.
It
looks
good
to
us.
C
Okay,
sorry,
I'm
not
sure
how
to
fix
that.
I
hope
you
can
tolerate
it.
C
I'll
just
continue
right
so
yeah.
I
work
for
the
advanced
technology
group
at
nursk,
and
this
is
about
openmp
for
gpus.
C
So
the
first
thing
is
to
look
at
the
openmp
thread:
hierarchy
for
gpus,
so
openmp
introduces
some
new
directives
in
order
to
be
able
to
use
openmp
on
the
gpu,
so
the
first
one
that
I
want
you
to
become
familiar
with
is
the
target
directive.
So
this
is
the
directive
that
enables
you
to
create
a
gpu
kernel.
This
is
what
is
enabling
you
to
execute
code
on
a
device
and
then
an
important
consideration
for
gpus
is
we
need
to
make
use
of
the
massive
parallelism
available,
so
we
have
to
create
two
levels
of
parallelism.
C
A
C
Has
introduced
is
a
form
of
coarse
grained
parallelism,
that's
suitable
for
gpus.
This
is
referred
to
as
teams
parallelism
and
then
later
on,
you
would
use
the
familiar
parallel
directive
in
order
to
create
the
fine-grained
parallelism.
So
using
both
together
enables
you
to
exploit
the
massive
parallelism
on
the
gpu.
C
C
C
An
important
thing
that
you
need
to
do
with
gpu
programming
is
obviously
moving
data
between
cpu
and
gpu.
These
have
distinct
memory
spaces
on
the
cpu.
You
have
your
dram
and
on
the
gpu
you
have
your
high
bandwidth
memory.
So
openmp
manages
the
data
in
the
gpu
memory.
It
refers
to
as
the
device
data
environment
using
a
combination
of
both
both
implicit
and
explicit
data
management.
C
C
C
Thing
to
kind
of
be
aware
is
that
we
have
this
single
variable
name
x,
but
this
is
actually
pointing
to
two
separate
variables
on
the
host
we
have
the
original
variable
and
on
the
device
we
have
a
corresponding
variable.
C
C
C
Target
teams
distribute
parallel
4
directive,
so
this
is
work
sharing
the
work
in
the
subsequent
loop
over
all
of
the
teams
and
all
of
the
threads
and
then
as
before.
We
have
this
map
clause
in
order
to
handle
our
data
management,
so
this
is
moving
the
data
to
the
gpu
and
back
from
the
gpu.
C
C
We
have
this
variable
x,
which
is
our
mapped
variable.
This
corresponds
to
brandon's
explanation
in
cuda
of
data
stored
in
global
memory,
so
this
is
accessible
by
all
of
the
threads
in
all
of
the
teams.
C
C
So
one
thing
that
jack
mentioned
at
the
start
of
today
is
how
important
it
is
to
minimize
data
movement
between
cpu
and
gpu
in
order
to
obtain
higher
performance,
and
so
what
we
can
do
with
openmp
programming
model.
Is
we
have
a
family
of
target
data
directives
that
can
be
used
to
keep
the
data
on
the
gpu
for
multiple
gpu
kernels?
C
C
So
this
will
now
remain
present
on
the
gpu
until
we
have
a
corresponding
exit
data
map,
when
we
would
actually
free
the
memory
on
the
gpu,
so
where
this
is
now
really
useful
is
we
can
now
have
multiple
gpu
kernels
that
can
access
this
variable
x
on
the
device
without
any
additional
data
movement.
There's
now.
C
You
to
specify
any
map
clauses,
because
what
the
openmp
runtime
will
see
is
the
usage
of
the
original
variable
x.
It
will
see
that
it
has
already
been
mapped
and
then,
when
you're
executing
your
gpu
kernel,
it
will
ensure
that
all
references
to
x
are
to
the
corresponding
device
variable
x.
So
it's
a
really
powerful
way
that
openmp
provides
in
order
to
minimize
data
movement.
C
What
how
the
openmp
programming
model
was
designed
is
to
make
sure
that
there's
no
unnecessary
data
movement.
So
what
it
does
is
it
reference
counts.
The
map
data
in
order
to
avoid
any
expensive
data
movement.
So
you
can
see
where
this
can
cause
problems
in
user
codes.
C
We
then
may
decide
in
the
user
code
to
update
this
variable
on
the
cpu
and
naively
you'd
expect
just
specifying
a
map
clause
in
your
target
region
would
then
propagate
that
value
onto
the
gpu.
But
that's
not
actually
the
case,
because
all
that
has
done
is
incremented.
The
reference
count,
so
the
next
two
slides
will
just
show
two
ways
for
us
to
fix
this
code.
C
C
C
Similarly,
another
method
you
can
use
is
to
use
the
always
modifier
for
the
map
clause.
So
what
this
will
do
is
it
will
force
a
data
transfer
irrespective
of
the
reference
cam.
So
now,
the
only
way
we've
modified
this
map
clause
is
to
add
this
always
always
work,
and
once
again
in
the
gpu
kernel,
we
would
see
the
updated
value.
C
C
To
be
aware
of
so,
this
was
kind
of
a
really
brief
intro
that
just
shows
some
of
the
compute
and
data
management
considerations
for
openmp
for
using
openmp
on
perlmaster.
We
recommend
the
nvidia
compiler.
This
is
for
all
c
c,
plus
blast
and
fortran
applications.
As
mentioned
earlier.
We
will
soon
have
the
clang
compiler,
but
this
would
only
support
c
and
c,
plus
plus
apps,
and
for
reference
in
terms
of
what
compiler
options
you
would
need
to
use.
C
There
was
a
presentation
on
day
one
building
and
running
gpu
applications
on
palomata,
and
also
we
have
this
web
page
at
nurse
which
not
only
goes
through
the
compiler
options.
It
also
includes
some
best
practices
for
how
you
can
get
high
performance
with
openmp.
C
C
So
this
has
a
similar
behavior
to
the
distribute
and
four
directives
that
we
showed
earlier,
but
it
has
one
additional
characteristic.
So
not
only
is
it
work
sharing,
it's
also
making
an
assertion
that
the
loop
iterations
are
independent.
So
this
really
enables
the
compiler
to
apply
additional
optimizations
and
deliver
improved
performance
to
you.
C
C
C
I
just
want
to
move
on
quickly
now
to
this
openmp
case
study
that
we
did
so.
This
looks
at
a
qcd
mini
app
called
su3,
and
the
highlight
of
our
case
study
is
we
managed
to
achieve
97
of
cuda
performance
on
an
a100
gpu
using
openmp
and
the
nvidia
compiler,
and
we
actually
presented
this
work
at
the
gtc
conference
last
year.
This
was
a
joint
presentation
from
me
and
grey
ozan,
who
is
a
compiler
engineer
at
nvidia.
C
Our
key
performance
plot
was
this
throughput
plot.
So
this
is
the
performance
metric
for
this
su-3
benchmark,
so
we
show
several
bars.
The
first
is
just
showing
the
performance
we
obtained
on
the
cpu,
which
was
139
gigaflops
per
second.
C
C
We
went
through
successive
code
optimizations
as
well
as
simplifications,
and
we
managed
to
achieve
this
97
of
cuda
performance,
which
is
a
really
nice
achievement,
and
I
decided
to
choose
this
particular
case
study,
because
this
was
one
case
study
where
the
loop
directive
enabled
us
to
obtain
the
highest
performance
with
the
nvidia
compiler.
C
A
C
Data
structures,
with
like
lots
of
pointers
and
double
pointers.
You
can
move
this
data
to
the
gpu
with
just
a
few
directives.
If
you
tried
to
do
this
with
cuda
and
a
runtime
api,
you
would
need
dozens
and
dozens
of
lines
of
code,
so
it's
very
very
burdensome.
Trying
to
do
this
with
apis.
C
There
are
other
kind
of
productivity,
wins
we've
seen
that
we
openmp
provides
directives
in
order
to
be
able
to
work,
share
loops
between
both
teams
and
threads
if
you're
using
cuda,
you
kind
of
have
to
manually
do
this
based
on
the
thread
id
and
the
block
id,
which
is
just
an
additional
burden
that
yeah
it's
just
a
pain
to
have
to
deal
with
another.
A
C
Of
productivity
win
is
that
you
can
very
easily
fuse
loops
using
the
collapse
clause,
so
this
is
really
nice
to
be
able
to
expose
the
parallelism
required
to
make
use
of
the
massive
parallelism
on
gpu
if
you're
using
cuda,
you
have
to
manually
fuse
the
loops
and
then
have
some
like
crazy,
integer
arithmetic
to
figure
out
the
it's
just
the
multi-dimensional
index
space.
C
Another
benefit
of
openmp
is
you
have
a
reduction
abstraction.
This
makes.
C
Easy
to
form
the
data
productions
if
you're.
A
C
Library,
in
order
to
obtain
a
high
performance
data
reduction,
so
so,
in
summary,
there's
there's
multiple
productivity
wins
by
using
something
like
openmp
over
cuda.
C
I
want
to
just
make
a
quick
note
about
openacc,
so
this
is
obviously
an
alternative
directive
based
approach.
The
concepts
are
very,
very
similar,
similar
directives.
Occasionally
they
have
different
names.
C
C
So
this
is
for
kind
of
two
reasons
I
mean
the
first
is
that,
because
it's
a
much
more
restrictive
programming
approach,
you
the
programmer
you're
forced
to
write
much
more
gpu-friendly
code,
and
the
second
is
that,
because
it's
a
more
restrictive
programming
approach,
it's
easier
for
the
compiler
to
support
the
capabilities.
C
C
In
a
paper
a
supercomputing
last
year
that
a
suite
of
nurse
openmp
applications
can
achieve
more
than
90
percent
of
openacc
performance,
so
there's
really
no
reason
to
be
concerned
that
openmp
applications
will
perform
poorly,
but
at
the
same
time,
if
you
have
an
opening
openacc
application,
there's
no
need
for
you
to
go
out
and
quickly
convert
it
to
openmp.
C
A
A
But
we
are,
of
course
here
to
support
you
and
we're
happy
to
engage
and
talk
through
any
of
the
options
that
were
presented
today,
and
we
can
also
help
you
out
if
you,
if
there's
something
you
didn't
see,
if
you
want
to
discuss
it's,
it's
availability
of
nurse
or
what?
What
your?
A
What
our
recommendation
is,
we're
happy
to
do
that
and
I
just
wanted
to
end
as
well
with
a
few
call
outs
for
those
of
you
on
the
nurse
user
slack
there's
a
number
of
relevant
slack
channels
for
the
community
of
users
out
there
who
may
be
also
writing
code
in
the
same
model.
So
you
could
get
in
touch
with
people
that
way.
A
A
Where
we're
happy
to
to
have
those
and
have
those
discussions
and
then
obviously
this
was
a
very
high
level
overview,
so
keep
an
eye
out
for
upcoming
events
that
are
are
targeting
specific
models
in
tool
chains.