►
Description
Sunita Chandrasekaran of the University of Delaware presents a talk on Using OpenACC to accelerate scientific applications on GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Muaaz Awan
A
A
I
believe
I
didn't
quite
catch
all
of
it,
but
you
know,
following
up
with
the
question
from
the
previous
talk
you
know,
directives
can
do
so
much
programming
languages
or
handwritten
code.
You
know
using
cuda
can
do
so
much
more,
but
you
know
that's
where
the
balance
we
have
to
strike
right.
A
So
this
talk
is
going
to
focus
on
open,
acc
and
first
want
to
take
a
second
to
thank
all
the
collaborators
and
the
funders
for
different
projects.
Going
on
in
my
group
and
just
a
small
plug-in,
there
are
a
lot
of
openmp
talks
in
the
in
the
agenda
that
I
noticed.
A
My
group
is
also
working
on
the
openmp
validation
verification
test,
suite
on
offloading
with
oakridge
national
lab.
So
it's
on
4.5,
5.0
features.
So
do
take
a
look
at
it.
I
have
been
writing
test
cases
and
there
is
a
it's.
An
open
source
project.
Pull
request
is
totally
you
know,
works
so
do
take
a
look
at
it
while
we're
talking
about
the
offloading
side
of
openmp
things
as
well.
A
All
right,
so
a
quick
recognition
to
all
my
lab
members
and
our
group
meeting
has
come
down
to
zone
group
meetings,
so
we
can
see
a
lot
of
square
boxes
and
crpl
my
lab
computational
research
programming
lab
members.
Without
them.
A
You
know
it
will
be
very
difficult
to
do
several
of
these
research
so
shout
out
to
each
and
every
one
of
them
I'm
starting
off
to
talk
about.
You
know
the
different
ways
to
program,
and
if
we
were
in
the
same
room
I
would
probably
have
a
show
of
hands
as
to
what
is
your
favorite
programming
model,
since
we
can't
do
that,
I'm
sure
all
of
you
are
grooming
because
we
see
different
boxes
of
different
sizes.
That
was
unintentional.
A
There
is
no
particular
reason
why
the
boxes
are
growing
in
size,
but
I'm
just
trying
to
fit
the
text
in
so
we
have
libraries
we
could
program
using
abstractions.
We
could
use
directives,
we
could
use
programming
frameworks.
What
have
you
right?
So
it's
good
thing
that
there
is
more
than
one
way
to
program.
A
A
One
was
relatively
easier
project.
One
was
not
so
I'm
just
trying
to
give
you
a
flare
of
you
know
what
are
the
different
directions
directives
could
take
and
how
directives
are
not
magical
and
we
have
to
do
our
fair
share
before
expecting.
You
know
detectives
to
do
their
their
their
magic
right.
A
So
I
believe
this
development
cycle
holds
good
for
any
programming
model.
You
choose,
you
know,
optimization
analysis,
you
first,
you
know
try
to
take
a
code,
you
try
to
profile
it
and
you
try
to
figure
out
what
are
the
portions
of
the
program
that
you
want
to
accelerate
and
then
you
parallelize
them
you're
happy
with
what
you
did.
You're,
not
happy
with
what
you
did
you
go
back
to
re-profile
it.
Then
you
optimize
it
and
then
you
reanalyze
your
code.
A
So
you
know
that's
obviously
the
kind
of
pattern
you
know
we
all
seem
to
be
taking.
A
A
Thank
you,
okay,
so
that's
the
cycle
right,
so
going
forward
a
little
bit
about
open,
acc
programming
model.
Obviously
I
cannot
fit
all
of
that
in
one
slide,
but
there
are
plenty
of
materials
out
there
that
you
could.
A
You
know,
go
back
and
take
a
look
at,
but
in
a
nutshell,
it's
a
directive
based
programming
model
and,
if
you
don't
know,
the
programming
model
can
be
used
to
also
run
code
on
cpus
just
like
openmp,
which
is
why
I
did
not
say
accelerate
code
on
gpus,
but
I
said
heterogeneous
systems,
which
means
you
keep
the
code
base
as
it
is.
You
retarget
you
recompile
to
a
cpu
as
well
as
gpu.
A
Pgi
community
editions
are
also
available
licensed
yet
free,
and
I
love
this
part
about
the
compiler,
because
I
can
get
my
students
to
download
the
pgi
open,
acc
compiler
on
their
laptops.
You
know
in
class
when
I'm
teaching
a
course
and
get
them
to
you
know
program
a
while.
We
are
talking,
so
that's
the
that's,
the
wonderful
part
of
the
community
editions.
I
believe
the
latest
edition
is
1910
and
the
link
points
you
to
the
community
edition.
A
But
it's
easy
to
find.
If
you
just
look
up
online,
so
we're
going
to
look
at
two
projects.
One
is
a
biophysics
project.
The
other
one
is
a
solar
physics
project,
the
biophysics
project.
I
just
wanted
to
give
you
a
you,
know
short
story
about
where
we
started
and
how
we
ended.
We
started
this
project
back
in
2018
and
I
think
2017
2018,
with
two
students
eric
and
marcio,
the
first
two
in
the
pictures,
who
were
my
undergrad
students
and
we
continued
to
work
on
the
project
and
they
loved
hpc.
A
They
loved
working
on
gpus
and
they
really
enjoyed
the
project
and
guess
what
they
turned
around
and
they
were
like
hey
how
about
we
pursue
phds,
I'm
like
who
can
say
no
to
that
right.
So
then
they
became
they
are
now
my
phd
students
in
my
lab.
So
this
project
is,
you
know,
like
a
poster
child
to
me,
because
it
drove
undergrad
students
to
work
on
a
gpu
project
and
it
led
them.
It
kind
of
motivated
them
to
you
know,
look
at
bigger
problems,
robbie,
who
is
now
in
nvidia
robert
sells.
A
He
was
also
my
phd
student,
who
graduated
last
year
and
alex
from
chemistry
department
in
ud,
served
as
mentors
robbie
graduated
last
year
and
he's
now
with
nvidia
alex
continues
his
phd
in
chemistry,
and
this
is
obviously
an
interdisciplinary
project
domain
science
from
chemistry
with
professor
juan
peria.
So
that's
there's
a
timeline
going
on
here.
We
did
not
do
this
in
two
weeks.
We
did
not
do
this
in
two
months
right.
We
can
see
that
it
started
in
2018
with
students
who
had
no
background
in
parallel
computing.
They
started
from
scratch.
A
They
were
learning
directives
on
the
fly.
They
were
learning
about,
profilers,
they
were
learning
to
use,
you
know
systems
and
it
scaled
up
and
we
landed
up
getting
wonderful
numbers
and
speed
up
and
we
were
able
to
publish
a
project
that
the
undergrad
students
started.
So
I
just
wanted
to
emphasize
here
that
it
takes
time,
but
if
you're
a
patient,
you
can
get
wonderful
results.
A
So
that's
the
purpose
of
working
with
a
domain
scientist.
You
can
get
a
cool
graphics,
which
also
revolves.
How
cool
is
that,
so
this
is
a
biophysics
project
where
the
idea
here
is
to
accelerate
prediction
of
chemical
shift
of
protein
structures.
A
The
code
is
from
ohio
state
university,
we've
got
a
serial
code
and
the
code
has
never
seen
a
gpu
before
and
it
was
a
scientific.
It's
a
chemistry,
chemistry
project
happening
in
ohio
state,
but
then
we
figured
that
this
was
kind
of
a
function
which
is
often
used
in
large
molecular
dynamics
package
and
it
has
been
taken
hours
to
run
and
the
idea
was.
Oh,
let's
take
a
look
what's
happening
in
the
code.
Could
we
do
something
about?
A
It
was
how
we
started
so
only
serial
version
is
available
and
now
a
parallel
version
using
open
acc
on
gpu
is
available.
So
there
is
no
cuda
version
on
the
code,
not
yet
so
what
we
started
with
was
serial
profiling,
and
this
was
so
much
important
and
I'm
going
to
show
you
a
little
bit
as
to
why
because
no
offense
to
any
domain
scientist
on
the
call.
But
I
do
want
to
point
out
that
you
know
scientific
code
is
usually
written
algorithm-facing
right.
A
The
scientific
code
is
not
typically
written
for
parallel
architectures
because
you
have
an
algorithmic
mind
and
you're
trying
to
get
the
equations
in
order
and
write
a
code
out
of
it.
But
when
we
jump
in
you
look
at
the
core
from
a
parallel
gpu
standpoint
and
you
try
to
say
this,
data
structure
could
have
been
designed
the
other
way
so
and
so
forth.
So
we
spent
about
you
know
a
month
or
two
or
maybe
even
more,
just
cleaning
up
the
serial
version
of
the
code
so
using
profilers
and
everything.
A
And
if
you
see
this
little
portion
of
23
4
14,
you
see
different
pies
in
the
chart
right.
So
the
very
first
goal
was
to
get
rid
of
bad
code
written
that
will
not
work
well.
For
you
know
a
gpu
architecture,
so
cleaning
up
the
code,
we
were
able
to
toss
out
the
23
get
select,
and
you
know
we
were
able
to
restructure
the
code
in
a
way
that
all
the
profiling
got
adjusted
and
you
see
that
another
major
chunk
stood
out,
which
was
the
get
contact
function.
A
So
we
started
going
in
the
order
where
the
percentages
stood
out.
Those
were
the
compute
intensive
hot
spots.
If
you
like
that,
we
wanted
to
accelerate
and
go
forward
so
zoomed
in
to
get
contact-
and
this
is
you
know,
a
piece
of
code
using
open
acc,
where
we
can
see
an
open
access.
Implementation
use
your
parallel
directive
to
handle
all
the
parallelization
of
the
loops.
Then
you
have,
you
know,
enter
and
exit
data
directives
which
will
manage
device
memory.
A
You
also
see
a
reduction
clause
to
facilitate
in
any
kind
of
parallel,
scalar
reductions
and
there's
text
behind
the
figure
that
you
see.
Basically,
what
goes
in
the
outer
loop
and
what
goes
into
the
inner
loop,
and
you
want
to
collect
routines,
which
don't
necessarily
need
to
be
running
within
a
loop,
several
several
iterations.
Rather,
you
can
collect
them,
accumulate
them
and
open.
Acc
also
allows
a
couple
of
different
levels
of
parallelism,
young
worker
and
vector
three
levels
of
parallelism
and
the
loop
that
you
do
not
benefit
from
parallelization.
A
You
could
mark
that
as
sequential,
which
is
the
last
loop
that
you
see
loop
seq
so
doing
this.
And
of
course
you
know
this
is
open
source
project
and
it's
on
github.
It's
called
ppm,
one
paths
per
million,
one
feel
free
to
drop
me
a
note.
The
entire
code
is
available
online
to
take
a
look
at
I'm
jumping
into
the
results
quickly,
where
we
see
that
the
serial
unoptimized
version
literally
took
about
14
hours
for
11.3
million
atoms.
A
We
didn't
even
finish
it
because
we
had
to
kill
the
code
because
you
know
the
system
wasn't
available
for
so
long,
but
definitely
14
hours,
and
you
can
see
that
the
scalability
is
from
100
100
000
atoms
through
11.3
million
atoms
on
the
left-hand
side.
Is
your
serial
unoptimized
serial
optimized
multi-core
nvidia
pascal
40
nvidia
volt
100?
A
So
imagine
the
serial
version
was
running
for
14
hours
and
this
was
a
function.
A
bigger
molecular
dynamics
package
was
calling
x
number
of
times
so
14
into
say.
100
times,
the
iteration
was
called
that
many
number
of
hours
you're
going
to
be
running
as
opposed
to
running
that
routine
for
47
seconds.
If
you
were
to
accelerate
it
on
a
gpu
and
this
code
was
made
possible
because
of
openacc
and
its
availability.
A
You
know
implementation
availability
on
gpus
by
undergrad
students
who
are
able
to
learn
the
whole
thing
and
be
able
to
apply
and,
like
I
said
it
was
close
to
a
two
years
project
by
the
time
we
published.
So
the
paper
is
open
access,
so
feel
free
to
download
and
read
through.
There
is
ton
of
more
material
and
it's
a
plus
computational
biology.
A
If
you
look
up
ppm-
and
you
know
university
of
delaware
and
open
acc
and
gpu,
the
paper
would
pop
up.
So
this
is
a
case
study
which
was
smooth.
I
would
say
we
didn't
run
into
hiccups.
It
was
the
usual
reprofiling.
You
know
optimization
hiccups,
but
nothing
major.
The
next
one
I
want
to
take
you
through
is
a
solar
physics
application,
because
I
want
to
tell
you
that
it's
not
a
hunky-dory
process
right,
you
earn
the
effort
that
you
put
in
so
we
put
in
enough
effort.
A
We
got
that
cool
speed
up
in
the
in
the
biophysics
project,
so
this
is
a
solar
physics
project.
Eric
wright
is
my
phd
student
working
on
this,
who
was
also
one
of
the
phd
one
of
the
students
working
in
the
previous
project,
and
this
is
in
collaboration
with
dr
rich
loft
and
team
at
encar,
as
well
as
max
planck
and
universal
system
research
in
germany.
A
This
was
this
is
a
tough
nut
to
crack.
This
has
been
a
an
interesting
project
and
it
is.
It
has
not
been
smooth,
but
I
do
want
to
tell
you
you
know
how
how
we
have
been
approaching
this
problem,
because,
given
a
domain
science
problem,
it's
we
all
know
does
not
matrix
multiplication
or
a
jacobi
iteration
right,
there's
a
lot
more
going
on
in
the
code.
So
the
the
the
crux
of
this
picture
that
you're
seeing
is
basically
a
a
300
million
dollar
nsf
investment
into
the
telescope.
A
You
know
recently
they
had
invested
so
much
money
and
the
telescope
is
hosted
in
hawaii
and
it's
generating
tons
and
tons
of
data
and
the
data
needs
to
be
processed
in
the
same
or
problem,
hopefully
in
the
speed
at
which
the
data
is
being
received.
So
it's
kind
of
a
big
data
problem
and
muram
is
the
code's
name
which
is
max
blank
university
of
chicago
radiative,
magneto
hydrodynamics,
that's
what
the
mhd
stands
for,
and
the
kernel
of
interest
to
encar
in
this
particular
problem
is
the
radiation
transport,
the
rt.
A
You
know
problem
of
this
muram
code,
so
this
is
probably
you
nurse
or
lbl.
Has
you
know
a
lot
of
tutorials
and
webinars
and
there's
also
a
roofline
analysis
coming
up
on
july
8th
some
of
my
students
have
registered,
so
I'm
just
throwing
in
a
bunch
of
you
know,
tools
that
we
have
used
so
that
you're
aware
these
tools
exist
and
you
can
look
up
them
to
learn
more
about
it.
A
So
we
used
aniprov,
kaptee
good,
occupancy
calculator
that
excel
sheet,
which
can
tell
you
so
much
about
the
gpu
that
it's
just
awesome
and
we
loved
the
pcast
tool.
We
have
used
this
used
extensively.
It's
also
a
pgi
tool
that
lets
you
compare
code
on
cpu
gpu,
which
is
so
important.
You
know
when
you
are
when
you
are
trying
to
figure
out
where
the
bug
is
where
the
error
is
so
yeah.
A
So
these
are
some
of
the
tools
that
we
have
used
and
the
screenshots
are
from
the
project
and
what
stands
up?
As
you
can
see,
the
top
three
rows
are
the
compute
intensive
rows.
This
is
again
on
a
single
core
just
in
seconds
just
toning
down
the
problem
to
you
know
bare
bones:
tbd
magneto,
hydrodynamics
mhd
and
the
radiation
transport
rt.
A
Encar
is
interested
in
accelerating
the
rp
portion
of
the
code,
but
if
you
have
profiled
and
looked
at
you
know
a
decent
size
of
code,
you
would
know
that
when
you
profile
and
clean
up
and
profile
and
clean
up
the
profile
shifts
right,
you
could
merge
some
kernels.
You
could
disintegrate
some
kernels.
You
could
move
around
things
when
you
can
and
when
the
dependency
is
sorted
out.
So
the
numbers
that
you
see
is
not
going
to
be
the
same
once
you
optimize
and
re-profile,
but
of
interest,
are
the
top
three
rows.
A
Oops,
sorry,
yes,
so
what
you
see
is
the
envy
prof.
You
know
you,
the
the
ones.
On
the
left
hand,
side
are
the
profiler
pictures
and
stepping
in
through
the
profiler
picture,
where
you
see
the
different
colors
of
the
different
slabs
of
radiation
transport
and
how
the
nvprop
really
tells
you.
You
know
where
the
data
computation
and
the
data
management
is
going
on
and
that's
a
gist
of
the
code,
how
many
lines
of
code
the
tools
we
used.
We
are
still
debugging
a
problem
and
I
could
talk
for
hours
about
software
engineering.
A
A
We
also
created
a
jupyter
notebook
to
figure
out
where
the
issues
are,
and
you
see
some
discrepancies
between
our
code
and
the
ground.
Truth.
We
needed
that
and
pcast
helped
a
lot
here.
I
think
these
slides
are
available,
but
that's
just
experimental
setup.
A
So
what
you
see
is
again,
you
know,
speed
up
on
the
nvidia
volt
100
of
rts
versus
a
single
core
and
a
full
node
of
32
cores.
There
is
a
lot
of
room
for
improvement,
but
that's
where
we
are
at
the
moment
when
we
run
into
the
bug
and
the
solar
physicists
were
not
happy,
so
we
had
to
fix
the
bug
first
before
we
moved
on
which
makes
total
sense,
and
so
this
was
a
tough
problem
right.
This
is
still
a
tough
problem.
A
We
don't
have
numbers
to
show
off
here,
like
the
biophysics
code,
so
I
want
to
wrap
up
by
saying
that
directives
are
not
magical.
It's
incremental
improvement
is
what
gets
you
to
where
you
want
to
go
to,
but
the
the
fun
part
of
using
openacc
was
to
be
able
to
recompile
and
retarget
in
the
biophysics
code.
We
have
not
changed
a
single
client
in
the
source
code
and
it
runs
on
the
multi-core
and
the
gpu.
A
I
mean
the
idea
of
using
directives
is
to
be
able
to
maintain
yeah
contact
me
if
you
have
any
questions,
and
I
wanted
to
stop
with
this
slide,
which
has
info
on
gpu,
hackathons
and
boot
camps,
and
if
you
go
to
that
particular
website,
you
have
tons
of
more
information
on
the
gpu
hackathons,
that's
ongoing
over
the
period
of
time.
B
Thank
you
very
much
anita.
This
was
very
inspiring
considering
that
you've
achieved
all
this
using
direct
test.
So
we
have
a
question
here:
did
you
have
any
issue
with
branching
during
porting
radiation
code
to
gpu.
A
Yes,
we
did
that's
an
excellent
question.
We
did
we
are
we
did
we
had
to
restructure
a
kernel
because
of
the
branching
issue.
As
we
know,
branching
is
detrimental
to
gpu
performance,
and
this
is
a
radiation
transport
code.
So
I
can
imagine
you
know
how
the
waves
travel.
You
know
from
one
corner
to
the
other
corner
and
I
think
I
can
suspect
where
this
question
coming
from
or
who's
asking
this.
A
B
A
Hi
ron,
so
here
performance
improve
on
cpu.
No,
we
did
not.
I
was
actually
glad
that
we
did
not
lose
performance
on
the
cpu
and
I
have
the
biophysics
code
in
mind
because
solar
physics
code
has
a
ton
of
more
work
to
do.
We
didn't
lose
performance,
but
I
wouldn't
say
we
necessarily
saw
performance
improvement,
but
that's
a
very
interesting
question.
B
Yeah,
that's
a
good
question,
so
that's
good
to
know.
I
think
it's
time
we
move
on
to
the
next
speaker.
Thank
you
very
much
sunita.
It
was
very
interesting
to
know
all
the
interesting
work
your
team
is
doing.