►
From YouTube: Current State of CRAY GPU Compiler
Description
John Levesque (HPE)
Current State of CRAY GPU Compiler
A
What
I
want
to
do
is
show
you
the
use
of
our
tools
and
compilers
to
move
an
application
to
a
GPU,
because
it
really
is
I
mean
they've
all
talked
about
putting
in
due
concurrent,
you
know
putting
in
Target
directives
or
whatever,
but
the
question
is:
where
do
I
put
them,
and
so
we
have
developed
over
the
years.
A
First,
with
the
Titan
work
at
Oak
Ridge
and
now
with
the
frontier
work
at
Oak,
Ridge
and
El
Capitan
at
Livermore,
we
have
developed
a
suite
of
tools
to
really
help
users
move
to
the
gpus
Okay.
So
this
is
once
again
like
the
previous
speaker,
a
three-hour
talk
and
I'm.
Just
gonna
go
hit
the
highlights
so.
First
off
we
have
an
excellent
performance
analysis
tool
called
perf
tools
and
Apprentice
is
a
GUI
to
really
look
at
things
like
timelines
call
chains,
I,
MPI,
communication
Etc.
A
The
other
thing
which
has
recently
been
extended
to
generate
openmp
offload
is
revealed
now
I
would
not
use
reveal
on
a
C
plus
plus
code.
I
would
use
it
on
Fortran.
It
is
really
a
scoping
tool
and
I'm
going
to
actually
show
the
use
of
reveal
in
this
presentation
now,
contrary
to
where
Nvidia
is
going
these
days,
I
love
directive
based
programming
models.
A
The
problem
is
you're,
always
going
to
have
to
use
some
directives,
and
you
know
they're
portable.
A
There
are
a
number
of
vendors
that
support
both
open,
MP
and
open
ACC.
We
support
both
and
we
generate
code
for
AMD
and
Nvidia,
and
so
you
know
really
there's
portability.
A
A
So
what
I'm
going
to
show
is
just
some
perf
tools:
output
from
a
very
simple
code
called
Amino
I,
don't
want
to
do
a
complicated
one,
because
we're
really
short
on
time,
but
to
use
perf
tools.
All
you
do
is
you
as
load
a
perf
tools,
light
module,
you
build
the
application,
you
run
the
application
and
the
results
come
out.
A
This
REM
here
is
to
get
an
annotated
listing
to
go
with
the
statistics,
so
here's
my
statistics
in
the
amino
there's
one
routine,
that
uses
all
the
time
it
does
do
a
Halo
Exchange
and
that's
where
the
real
problems
gonna
hit
you
when
you
move
to
the
GPU
okay,
so
it
also
breaks
it
down
into
the
maximum
function
times
on.
This
was
run
on
two
MPI
tasks
per
node:
okay,
I
and
load
imbalance
Etc.
A
The
other
thing
is,
it
gives
you
on
a
line
level
where
all
the
time
is
being
spent,
and
so,
if
I
now
look
at
that
annotated
listing
I
see
that
all
the
time
is
being
spent
and
right
here
and
there
there
is
an
iterate
iteration
Loop
and
then
a
triple
nested
loop
with
the
reduction,
and
this
is
the
Jacobi
relaxation.
A
Okay.
So
now
the
big
question
with
this
is:
what
are
the
loop
iterations
I
mean
because
in
order
to
really
figure
out
how
best
to
put
this
on
the
accelerator
I
want
to
know
what
came
Max,
J,
Max
and
IMAX
are
okay.
So
we
have
another
thing:
called
perf
tools,
light
loops
and
what
this
is
going
to
do
is
get
us.
The
loop
iteration
count
and
it
now
I'm
displaying
the
call
tree
where
this
is
the
main
program
that
calls
Jacoby
I.
Think
that's
an
error.
A
I,
don't
think
Jacoby
uses
over
a
hundred
percent
of
the
time
and
then
here's
our
looping
structure
and
it
shows
us
the
average
loop
trip
count.
And
then
this
is
the
initialization
and
the
MPI
down
here.
A
Okay,
so
now
in
the
annotated
listing,
you
also
notice
that
it
tells
me
that
the
compiler
vectorized
this
Loop
and
unrolled
it
by
three
I
and
the
reason
it
unrolled.
It
is
because
of
all
these
I
I
minus
one
and
I,
plus
one
because
you're
really
effectively
utilizing
cash
that
way.
A
Okay
and
now
what
we're
going
to
do
is
just
use
the
reveal
to
generate
an
open,
MP
code,
and
so
we
bring
up
reveal
we
select
that
outer
loop
on
iterations,
which
is
we
don't
want
to
do
that?
That's
not
parallelizable,
because
it's
iterating
and
it
even
calls
MPI
routine
here.
So
we
kind
of
dropped
down
and
we'd
choose
k,
and
now
we
bring
up.
A
Another
window
comes
up
and
it
says:
do
you
want
to
scope
for
a
GPU
or
scope
for
a
CPU,
and
so
I
do
CPU
initially
and
then
it
comes
up,
and
it
tells
me
all
of
the
variables
that
are
scope,
private
and
shared.
This
means
that
that's
a
reduction
function,
but
there
are
no
Inhibitors
to
this
Loop
and
so
I
just
insert
directives
and
and
I
look
at
there.
Here
are
the
directives
you
notice,
it
always
uses
default,
none
so
that
it.
This
is
a
good
way
to
do.
A
A
So
that's
openmp,
but
now
I
want
to
generate
GPU
code,
all
right
and
so
I
go
back
and
I
choose
GPU,
and
now
there
are
several
things
going
on
here:
notice:
the
G
the
G
is
very
important.
Basically,
what
that's
saying
is
those
arrays
are
in
a
module,
and
so
we're
going
to
have
to
do
special,
Extra
Care
to
make
sure
that
arrays
are
updated
prior
to
returning,
because
those
arrays
may
be
used
in
other
places
and
when
you
have
a
module
or
a
common
block.
A
You're
going
to
have
problems
like
this,
with
both
open,
ACC
and
open
MP.
Now,
unified
memory
gets
you
away
from
this
problem.
So
that's
one
thing
that
we
are
making
sure
that
we
generate
extremely
good
unified
memory
code
so
that
we,
you
don't
have
to
deal
with
this
Global
variable
issue.
A
Okay,
so
basically,
these
are
the
openmp
offload
directives
that
were
based
generated
now
the
problem
is,
is
there
are
some
map
always,
and
that
is
because
of
this
Global
nature.
In
other
words,
it's
always
going
to
have
to
write
on
some
arrays
back
to
the
host
and
one
array
to
and
from
the
host.
A
Okay,
so
now
we're
we're
gonna
use
perf
Tools
in
order
to
profile
the
GPU
okay.
Now
perf
tools
is
not
as
automatic
as
the
Light
Group.
You
have
to
first
do
a
pat,
build
and
then
run
the
instrumented
version
and
then
do
a
Pap
report,
but
this
gives
you
excellent
information.
A
A
A
Well,
there's
a
loop
in
between
so
yeah
p
is
set
equal
wrk2
and
then
there's
an
all
reduced,
but
we
want
to
look
at
the
Halo
exchange
because
we
cannot
afford
the
message
passing
around
this
Halo
Exchange,
okay
and
so
what
I'm
going
to
move
to
and
I
Helen
I'm?
Sorry,
because
I
didn't
I
can't
find
the
version
of
this
code.
A
Where
I
did
the
Halo
Exchange
in
a
mental,
so
I'm
going
to
go
to
Leslie
3D,
oh
before
I
go
there
I
want
to
kind
of
tell
you
the
other
thing
I
did
was
to
put
all
the
kernels
that
use
the
variables
that
are
accessed
and
that
Loop
I
want
everything
to
be
on
the
accelerator
okay.
So
this
is
the
thing
that
really
reduced
the
data
movement
Okay.
So
when,
when
you
are
packing
buffers
on
the
accelerator,
you
typically
have
this.
A
So
this
is
kind
of
the
North
pulling
out
the
part
of
the
3D
array
and
then
I'm
sending
the
north
to
my
neighbor,
okay.
So
to
put
that
on
the
accelerator,
it's
very
easy.
A
You
just
do
a
Target
and
then
I'm
updating
the
host
with
the
buffer.
Now
there
are
corresponding
receives.
A
A
And
our
excuse
me
to
the
host
I'm
sorry
and
then
do
this
I'm
sorry.
This
is
on
the
accelerator
tooth.
Well,
anyways!
You
know
what
I
mean
okay,
but
there
is
something
that's
much
much
nicer
and
that
is
using
device,
arrays
or
device
pointers.
So
what
I've
done
is
I've
made
all
of
the
buffers
device,
pointers
and
now
I
do
not
have
to
do
any
updates.
A
All
I
do
is
I
use
those
pack
and
unpack
loops
on
the
accelerator
and
and
and
then
I
use
this
device
pointer,
and
it
turns
out
that
now
this
doesn't
even
go
to
the
host.
It
goes
directly
from
the
accelerator
to
the
Nick
and,
and
so
this
really
improves
your
Halo
exchanges
and
the
same
thing
with
the
receive
buffers.
A
Okay
now
I
I
ran
this
actually
on
pisdan
and
Lugano,
and
this
doesn't
really
help
you
that
much
on
small
numbers
of
nodes
using
device
buffers.
But
when
you
get
to
large
number
of
nodes,
you
really
get
a
significant
increase
in
speed,
because
you're
really
cutting
down
all
that
overhead
of
transferring
that
data
back
and
forth
to
The
Host.
A
A
Now
the
other
thing
I
wanted
to
show
you
is
we
have
this
debug
environment,
one
of
the
developers,
who
is
an
awesome,
awesome
coder
he
put
in
his
own
debug
and
I
was
this
was
way
back
in
the
days
of
Titan
and
I
was
porting
s3d
with
open
ACC
to
tighten
and
I
needed,
something
to
help
me
debug
the
code
and
he
said:
hey
I.
Have
this
a
cray
ACC
debug?
A
There
are
three
levels,
one
two
and
three
one
shows
you
all
of
the
transfers
and
all
of
the
kernel
executions
where
they
come
from
and
what
on
the
line
number
so
what
routine
and
what
line
number
and
and
then
sinks,
and
so
this
is
just
you
know,
it's
transferring
29
items
Etc
and
now
it
it
says
it
transfers
that,
but
then
notice
it
says,
there's
zero
bytes
transferred,
and
that
is
because
that's
what
the
analysis
said
to
do,
but
the
compiler
recognized
that
those
arrays
were
already
on
the
host
or
where
they
should
be
okay.
A
This
is
two
and
now
this
gives
you
the
actual
array
name
and
the
size
of
the
array
that's
being
transferred.
Okay
in
in
this.
So
you
have
this
and
it's
coming
out
on
air
around
and
then,
if
you
hit
a
problem
on
the
accelerator,
you
know
exactly
where
it
is
because
this
stops
the
output
stops,
it
tells
you
it
probably
came
off
in
a
kernel.
A
It
shows
you
exactly
where
that
is
now.
This
is
the
kitchen
sink,
which
is
extremely
useful,
because
this,
really
it
tells
you
everything
it
tells
you
the
size.
It
gives
you
the
host
and
the
accelerator
pointer.
It
gives
you
the
striding,
and
this
is
it.
A
The
compiler
has
this
present
table
and
in
the
present
table
is
a
region
of
data
that
represents
that
data
element,
and
this
is
why
you
can
equivalence
variables
and
still
recognize
that,
in
fact,
it
might
still
be
on
the
on
the
accelerator,
and
so
this
is
just
so
valuable.
It's
an
extremely
useful.
A
Perf
tools
is
excellent
for
identifying
issues
in
existing
applications
for
improving
threading,
vectorization
and
scalar.
Optimization
I,
I've
I've
been
using
this
for
20
years.
I
am
the
biggest
user.
A
If
you
ever
have
any
problems
with
perf
tools,
ask
me
reveal,
can
really
help
with
scoping
variables
now
in
amino.
It
didn't
even
need
any
help
from
the
user,
but
in
something
more
complicated.
It's
going
to
come
back
and
it's
going
to
tell
you
you
know,
I,
don't
know
what
to
do
with
this
variable
and
then
the
user
can
say
I
want
to
make
that
private
or
it
should
be
shared
now
I,
don't
recommend
using
reveal
on
C
plus
plus
it's
just.
A
There
are
too
many
issues
with
C
plus
plus
now
moving
to
the
GPU
is
difficult.
However,
you
can
do
it
in
steps
that
are
more
manageable
and
perf
tools
identifies
the
bottlenecks
for
you
very
quickly
and
finally,
GPU
direct
is
the
best
way
to
do
message
passing,
and
this
is
even
getting
better,
because
on
on
Frontier,
the
MPI
can
actually
run
on
the
accelerate,
and
so
you
don't
even
have
to
have
the
host
invoke
the
MPI
call.
A
There
are
three
cases
in
in
that
that
I
don't
have
time
to
go
into
because
I'm
out
of
time-
and
that
is
my
last
line.