►
From YouTube: 4 Accelerating code with Directives
Description
Part 4 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.
A
Okay,
so
you
have
learned
by
examples,
those
of
you
who
had
never
seen
before
openmp
open
you
see
now
you
are
familiar
with
the
syntax
with
some
of
the
fundamental
pragmas
some
of
the
fundamental
closes,
but
there
are
some
additional
pragmas
and
closes
that
you
can
use
during
the
practicals.
So
here
I
will
try
to
introduce
the
additional
primers
and
closer
that
you
will
be
able
to
will
be
able
to
use
you
in
the
practicals.
A
Okay,
parallel
really,
nothing
else
to
explain.
Remember
parallels
where
you
defined
the
parallel
region
until
that
point
single
threaded
code
after
the
parallel
region,
single
threaded
code.
Again
so
at
the
beginning
of
the
parallel
region,
threads
are
created
all
of
them
work
in
parallel
at
the
end
of
the
parallel
region.
All
of
them
are
destroyed,
except
for
one
that
continues
the
single
threaded
execution.
Okay.
So
this
is
essentially
what
we
are
explaining
here,
and
this
is
the
syntax
in
C
C++,
an
important
so
I
think
it
doesn't
make
me
sense
to
stop
more
here.
A
A
What
pragma
would
provide
me
more
performance,
kernels
or
parallel
in
some
sense,
both
of
them
specify
the
beginning
and
the
end
of
a
parallel
region?
What's
the
difference,
the
difference
is
that
in
parallel,
the
one
we
have
been
using
and
the
one
that
he
will
really
suggest
or
recommend
to
use
you
are
developers
are
responsible
for
use
doing
best
practice
using
currently
the
problems
and
the
closest,
if
we
don't
use
them
properly.
The
parallel
code
that
we
will
create
will
be
incorrect.
Okay,
it's
our
responsibility.
A
Kernels
is
an
attempt
of
open
SEC
to
release
to
get
so
that
the
primer
can
get
rid
of
that
responsibility
so
who
discovers
the
parallelism
if
it's
not
the
programmer,
using
a
pattern
based
approach
or
a
classical
approach
of
tryin
tryin
test
different
pal
implementations?
Who
does
the
discovery
of
parallelism
here?
Who
implements
generates
the
parallel
code?
For
us?
It's
only
one
piece
of
software
I
can
do
this.
The
openness
is
a
compiler.
Ok,
the
open
is
a
compiler
as
any
other.
Any
compiler
have
capabilities
to
discover
parallelism
in
real
code.
A
All
of
these
things
that
we
use
in
every
single
call
defeats
makes
the
classical
dependence
analysis
and
data
flow
analysis
ineffective.
It
doesn't
work
to
this
core
parallelism.
Okay,
but
again,
if
that
technology
somehow
improved,
then
kernels
would
be
a
way
for
us
to
get
rid
of
the
responsibility
of
finding
and
discovering
the
parallelism
and
implementing
the
parallel
version.
Ok,
but
the
reality
is
that
the
state-of-the-art
today,
the
compilers,
cannot
are
not
very
effective
at
discovering
parallelism
in
real
code.
A
Ok
indeed,
in
the
newly
simce
example
that
we
have
in
the
practical
you
will,
you
can
check
kernels,
and
you
will
see
that
no
compiler
can
discover
parallelism
that
parallel
whatever
you
can
discover,
because
we
are
using
a
completely
different
way
of
discovery.
Parallelism
ok
is
the
IP
of
the
painter
of
the
company
that
we
have
incorporated
five
years
ago.
A
So,
but
anyway,
it's
important
to
know
that
this
exists
so
that
at
some
point
you
can
even
try
and
test
the
difference
in
performance
between
parallel
pragma,
using
the
pattern
based
approach
and
kernels
to
see
how
far
can
a
compiler
get
and
make
this
job
for
us.
Ok
indeed,
some
of
the
practicals
we
proposed
in
can
explore
this
part
of
comparing
the
performance
that
you
get
with
parallel
and
with
Colonels
directives.
Okay,
so
it's
good
that
you
know
how
to
use
that
is.
It
exists,
a
list,
the
syntax
variously
pragmas.
A
A
If
we
don't
specify
a
work-sharing,
then
our
loop
iterations
are
being
replicated
in
each
of
the
threads
ten
iteration
one
thread:
ten
iterations
in
the
execution,
ten
iterations
to
thread
ten
plus
ten.
It's
ready
execute
in
10
20
threads
20
threads
x,
22
ratios,
but
we
don't
want
the
replication
of
our
code.
When
we
go
to
parallel
situation,
we
want
to
divide
the
workload
among
the
threads,
not
to
multiply
the
workload
by
the
number
of
deaths.
Okay,
so
work
sharing
is
essential
to
specify
parallelism,
okay
and
to
really
have
a
parallel
version.
A
Okay,
three
levels
of
parallelism
work
sharing
on
the
CPU
is
simple.
To
understand.
I
mean
one
thread
begins.
The
execution
I
find
the
parallel
region,
I
create
10,
threads
I
have
50
iterations,
so
52
ratios
between
10
threads,
it's
radius
allocated
is
assigned
5
iterations
and
all
of
the
threads
can
communicate
a
synchronize
with
all
of
the
all
of
the
threads.
So
the
execution
model
of
the
multi-threaded
CPU
is
simple
to
use.
A
Ok,
but
that
is
not
the
case
of
the
GPU
remember
when
we
began
this
morning
that
we
said
that
the
GPU
have
a
phlex
memory
design
so
that
you
have
a
hierarchy
of
memories
in
the
cpu.
You
have
the
main
memory
and
the
cache
in
the
GPU.
You
have
the
main
memory,
the
shared
memory,
the
cache,
the
scratched
paths,
different
types
of
memory
and
not
all
threads.
Can
access
to
all
of
the
memories
does
the
main
difference
with
a
multi-threaded
CPU.
So
there
are
restrictions
that
are
imposed
by
the
server.
A
So
how
do
we
as
programmers
can
lead
with
this
complexity?
Ok,
open,
MP
and
open,
as
you
see,
provide
a
way
to
handle
this.
That
is
when
you
do
work
sharing.
You
can
specify
work
sharing
at
three
levels.
Let's
call
it
generically
coarse
grain,
fine
grain
scene,
director
called
grain,
means
that
when
all
the
threads
are
created,
imagine
100
threads.
These
threads
are
grouped
by
groups.
Imagine
that
the
groups
are
of
50,
so
two
groups
of
50.
A
What
this
means
is
that
each
group
has
a
representative
thread
that
can
communicate
with
the
representative
of
the
other
group,
but
not
with
the
other
threads
of
the
whole
other
world.
Okay,
so
this
each
of
these
Gantt
threads
can
communicate
with
its
workers
using
open,
SEC
terminology
and
can
communicate
with
other
guns,
but
not
with
workers
of
other
gangs.
A
Okay,
so
the
GPU
is
a
crucial
model
and
the
open
SEC
on
open,
MP
execution
model
for
GPO
provide
this
functionality
to
somehow
simplify
the
control
of
the
how
the
threads
are
grouped
on
the
GPU
automatically
by
the
hardware.
Okay,
so
in
open
ECC,
we
have
a
clause
that
is
called
gunk.
What
gang
means
is
that
when
you
specify
a
reduction,
you
can
say
I
want
to
make
a
reduction
at
the
gang
level.
A
What
this
means
is
that
all
the
gangs
of
each
of
the
groups
will
collaborate
will
communicate
with
each
other
to
make
the
reduction
of
all
the
local
partial
result
computed
in
each
of
the
runs.
So
you
can
do
a
reduction
between
guns.
If
you
specify
a
reduction
at
the
worker
level,
you
will
not
get
the
correct
result
that
you
expect
why?
Because
at
the
worker
level,
you
will
have
workers
within
this
gun,
making
a
reduction
workings
within
this
gun,
making
a
reduction,
but
the
guns
will
not
communicate
to
make
the
final
reduction.
So
with
you.
A
You
will
not
be
current
in
atomicity
and
mutual
exclusion
to
make
the
reduction
correctly
okay.
So
even
these
levels
of
parallelism
that
I
usually
used
to
increase
optimize
the
performance
of
your
application
on
the
GPU
can
even
lead
to
incorrect
code
that
produces,
in
current
numerical
results.
For
instance,
using
reduction
operations
reduction
operations
are
defined
in
operation
C
to
work
only
at
the
gang
level,
not
at
the
working
level,
not
at
the
vector
limit
in
openmp
we
have
an
equivalent.
We
have
again
three
levels,
so
the
current
level
is
a
specified
by
the
teams.
A
Distribute
the
worker
level
is
a
specified
by
the
parallel
form.
Okay,
an
additional
level
is
usually
seen
do
vector
so
within
all
of
these,
the
threads
are
somehow
tied
to
each
other
so
that
all
of
them
are
used
in
different
length
of
the
vector
hardware,
so
this
happens
on
the
GPU
using
vector-
and
this
also
happens
on
the
CPU.
If
you
take
multiple
threads
and
you
back
to
eyes
some
inner
loops
within
the
multi-threaded
code,
okay,
so
the
importance
of
the
three
levels
of
pollution
for
the
GPU
again.
A
Just
to
summarize,
do
you
need
to
remember
that
this
exists?
Do
you
need
to
remember
that
when
you
are
doing
Foresters
reduction
operations,
you
can
only
make
reductions
on
the
gang
level?
If
you
do
it
at
lower
levels,
the
result
will
not
be
as
expected,
because
all
the
threads
can
not
communicate
with
the
rest
of
the
threads.
The
result
will
be
incorrect.
Okay,
so
this
has
impact
in
performance
but
more
important,
even
on
correctness.
So
we
need
to
be
aware
of
this.
A
Any
questions
all
this
just
be
aware
of
this.
When
we
want
you
to
the
practical,
okay
atomic
we
have
already
seen
atomic,
so
we
have
atomic
available
for
C
C++
in
OpenMP,
and
we
also
have
atomic
available
for
the
C++
important
in
open
ACC.
So
the
parallel
loop
with
atomic
protection
to
implement
reductions.
We
can
use
that
strategy
to
make
execution
in
parallel
of
reduction
operations
in
the
GPU
and
atomic
operation.
The
GP
were
extremely
effective.
There
is
a
great
improvement
in
the
hardware
support
so
some
years
ago
they
were
very
costly.
A
You
cannot
expect
a
good
performance
increase
right
now.
The
atomic
operation
of
the
GPU
are
really
highly
optimized.
So
you
is
something
that
we
can
use
to
you
to
create
parallel
code
on
the
GPU
using
attorney
protection.
Ok,
so
you
can
do
it
and
you
can
play
with
it
in
the
paragraph
22
target
and
data.
We
have
already
seen
this
remember
that
when
you
go
to
the
GPU,
remember
the
GPU
secretion
model
of
loathing
host
driven
the
holster
starts
at
some
point.
A
He
decides
that
part
of
the
code
is
offloaded
to
the
GPU,
the
code
to
be
executed
sent,
but
we
all
need
to
also
send
the
data
that
is
needed
to
make
the
computations
visible
widow
with
data
in
open
ECC
with
target
data
in
open
MP,
and
then
we
need
to
do
ways
to
control
code
data
transfer
from
the
CPU
memory
to
the
GPU
memory.
This
is
copying
an
open
ACC
map
to
in
openmp
to
copy
data.
A
Worst
a
result
is
computed
on
GPU
transfer
back
to
the
data
to
the
CPU,
so
that
we
can
see
the
output
remember
that
the
execution
is
caused
driven
so
copy
out
or
map
from,
and
the
recent
data
that
you'll
probably
want
to
copy
in
and
out
for
some
reason
in
your
application,
so
copy
or
map
to
fro.
Ok
and
you
can
see
parallel
trainer
generates
copy
in
copy
out
copy
and
map
closes
for
you
to
handle
data.
A
A
Imagine
that
you
have
an
array
of
1
million
elements
and
you
have
a
loop
that
processes
1000
elements
you
want
to
offload
those
computations
to
the
GPU.
Could
you
transfer
the
1
million
elements
to
the
GPU?
No,
you
would
want
to
transfer
only
get
the
region
of
the
code
of
the
array
that
is
really
used
during
the
computation.
A
So
you
need
to
specify
somehow
that
from
an
array
of
1
million
elements,
only
these
1000
elements
at
the
start
here
and
then
here
is
what
needs
to
be
transferred
to
the
CPU
for
the
CPU
memory
to
the
GPU
memory.
Ok,
this
is
what
we
call
a
reshaping,
so
we
have
a
reshaping
from
for
1d
arrays
to
the
arrays
3d
arrays,
multi-dimensional
arrays
and
the
way
we
specify
this
using
the
same
centers
that
we
use
to
allocate
arrays
statically
in
memory.
A
You
see
we
can
create
flow
out,
X
bracket
1000
in
bracket
and
this
statically
locates
an
array
in
memory
in
Fortran.
You
can
create
also
arrays
using
I,
think
it
is
the
parentheses
notation
okay.
So
essentially
it
is
the
same
notation
and
you
specify
where
the
elements
start
and
how
many
element
you
want
to
transfer
okay,
and
this
is
essential
to
minimize
the
data
transfers
from
the
CPU
to
the
GPU
and
back
from
the
GPU
to
the
CPU.
A
A
Finally,
only
two
slaves
remaining
remember
that
we
say
that
if
we
want
to
learn
how
to
paralyze
real
code,
your
code
has
loops
that
call
routines.
Your
code
has
looked
at
call
routines
if
we
cannot
paralyze
a
loop
that
contains
a
route
call
to
our
routine,
we're
in
trouble.
Okay,
so
we
need
somehow
a
directive
where
we
can
mark
what
routines
need
to
be
executed
on
the
GPU.
A
Remember
that
when
you
compile
code
for
the
CPU,
the
binary
code
runs
on
the
CPU
architecture,
but
the
GPU
has
a
different
architecture,
so
we
need
to
compile
a
different
version,
binary
code
of
that
routine
to
be
run
on
the
GPU
her
world.
So
how
do
we
specify
this
in
openmp
an
open
sec?
If
my
loop,
as
you
will
address
in
tell
us
m'kay
practical
calls,
functions,
I
need
to
say
this
function
and
this
function
will
be
offloaded
to
the
GPU.
Please
compiler
generate
a
CPU.
A
Finally,
a
binary
version
to
be
secure
in
the
CPU
and
also
another
binary
to
be
executed
on
the
GPU
and
in
both
them
as
you
need.
When
the
code
is
offloaded.
All
the
code
is
executed
on
the
CPU,
so
this
is
what
routine
does
in
operation
C,
and
this
is
what
declared
target
does
in
Opera
B.
So
if
you
have
a
full,
apparently
loop,
imagine
that
this
is
a
fully
parallel
loop.
So
a
parallel
for
all.
You
need
to
specify
that
foo
will
be
offloaded
so
that
the
gender
binary
is
generated
for
the
GPU.
A
So
in
the
Declaration
of
the
older
function
before
the
signature,
you
have
pragma
CC
routine,
and
here
you
have
several
modifiers.
We
will
use
SEC
only
for
this
particles
and
with
this
you
currently
that
the
compiler
will
generate
a
version
of
food
that
will
run
on
the
GPU
whenever
it
is
needed
to
get
to
offload
disk,
ok,
something
that
people
usually
does.
Is
they
inline
the
routine?
What
this
means
is
that,
instead
of
using
routine
SEC,
you
take
all
the
body
of
the
routine
and
you
a
place,
they
call
with
the
body
of
the
function.
A
You
can
do
this,
yes,
you
can
do
you
avoid
the
call
to
the
routine?
Yes,
you
avoid
it,
but
you'd
make
your
call
less
a
structure.
You
are
going
against
right
in
a
structure,
well
the
structure
code
that
makes
your
comment
anal.
So
it's
better
practice
to
use
this
instead
of
inline
in
their
routine,
where
it
is
going.
Okay,
you
will
have
to
play
with
this
in
the
u.s.
m'kay
practical
and
finally,
those
of
you
that
work
with
C
and
C++.
You
know
restrict
and
cost.
A
They
are
not
explicitly
needed,
sometimes
to
generate
parallel
bases
or
parallel
code.
But
some
compilers
may
request
that
some
of
the
argument
that
you
have,
for
instance,
in
the
signature
of
a
function
when
there
are
pointers
that
you
explicitly
mark
those
pointers,
are
pointed
that
cannot
allow
us
one
another.
They
cannot
overlap
in
the
regions
of
memory
of
reproduce
of
memory
that
you
can
access
by
the
references.
The
reference
in
that
point-
okay,
so
those
of
you
that
are
familiar
with
pointers
will
see
a
C++.
A
You
will
probably
find
this
tree
straight
and
useful
in
Fortran.
You
usually
don't
address
these
issues
because
Fortran
by
default
is
you
allocate
arrays
and
they
are
located
in
separate
memory
regions
that
cannot
overlap,
and
a
apart
from
that
when
you
use
pointers,
you
have
a
restricted
implementation
of
pointers
that
is
not
as
powerful
as
this
one.
But
again
this
makes
programming
easier
and
writing
compilers
easier
in
some
sense.
Okay,
so
this
is
essentially
something
that
you
will
probably
need
for
C
and
C++
real
goals.
A
Okay,
so
we
have
emitted
this
practical
of
heat
for
this
course
here,
but
I
think
you
have
the
example
of
heating
the
injured
in
the
participants
materials
that
Helen
has
shared
with
all
of
you.
So
again,
heat
has
some
combination
of
patterns
of
loops,
so
you
can
play
with
all
of
the
patterns
that
you
have
seen
here,
but
for
this
after
Lance
will
recommend,
instead
of
using
the
making
the
knowledge,
the
heat
practical
that
you
really
go
into
the
complexity
of
the
Loula
sh
and
k
practical.
A
So
we
can
help
you
to
understand
the
complexity
of
paralyzing
real
codes
and
how
the
composition
in
components
and
the
composition
impetus
can
really
help
you
to
understand
how
to
paralyze
real
codes,
even
if
it
is
the
first
time
that
you
see
the
code,
you
need
to
understand
the
science
behind
the
code.
You
will
need
to
find
properties
in
the
code
and
these
properties
are
remarket
captured
by
the
patterns
themselves.
So
that's
what
you
really
need
to
change
in
your
menses.
Okay,.