►
From YouTube: 3 Parallel Patterns
Description
Part 3 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.
A
The
break
is
finished,
so
we
need
to
continue
with
this
agenda
of
this
peril
where
tools
workshop
so
before
the
break
we,
it
was
our
really
great
session.
We
had
a
so
many
question,
interaction
from
you
that
I
really
hope
that
we
keep
on
having
this
great
learning
of
fear
and
essentially
what
we
introduce.
As
you
remember,
the
very
basic
concept.
A
The
minimum
causes
that
we
need
to
understand
what
we
are
doing
when
we
go
to
the
GPU
and
we
saw
the
key
differences
between
the
CPU
and
the
GPU,
and
we
also
saw
examples
of
how
the
open
ACC
an
opening
pickle
looks
like
for
multi-threaded
CPU
and
for
GPU,
and
we
used.
We
did
this
without
really
introducing
the
formally
that
there's
a
mantis
of
the
pragmas
on
the
process.
We
somehow
learn
and
God
used
to
the
syntax,
to
the
examples
that
we
show
in
the
demonstration
of
the
of
the
tool
using
this
PI
calculation
example.
A
So
I'm
very
happy,
because
we
also
covered
many
many
of
the
features
that
we
have
available
in
in
the
tool,
and
we
also
discussed
many
interaction
issues
that
might
appear
with
the
file
system.
With
other
tools
with
workflow,
so
it
was
really
really
a
great
read
session.
So
now
we
are
going
to
jump
into
the
more.
A
New
key
concepts
or
distinguishing
part
of
this
training
that
is
learning
them
the
families
of
parallel
patterns
that
we
are
considered,
considering
right
now,
learning
what
different
policies
and
strategies.
Sorry,
we
have
for
each
of
these
patterns
and
this
politician
strategies:
where
can
they
be
applied
to
CPUs
GPUs
multi-threading
of
loading
tasks
in
open,
MP,
open
sec?
A
So
first
I
would
like
to
begin
with
thinking
about
what
open,
MP
and
open
ACC
really
do
for
us.
These
are
texts
extracted
from
the
specification
of
open,
SEC
and
I,
really
like
to
highlight
this
part
more
or
less.
Everything
is
a
consequence.
Off
of
this.
Programmers
need
to
be
very
careful
that
the
program
uses
appropriate
synchronization.
A
What
this
means
is
that,
in
practice,
openmp
open
SEC,
don't
guarantee
that
the
code
is
correct.
It
is
us
as
programmers
that
have
to
guarantee
that
the
pragmas
and
closes
that
we
add
to
the
code
will
behave
correctly
in
parallel.
Open
MP
and
open
ACC
provide
a
great
support
from
the
compiler
that
makes
all
the
hardest
stuff
of
generating
the
call
to
the
multi-threading
library
of
the
operating
system.
A
That's
the
hard
part
of
the
word
that
the
compiler
is
doing
automating
that
part,
but
we
are
still
is
possible
for
selecting
and
writing
the
appropriate
pragmas,
the
appropriate
clauses
and
the
appropriate
options
to
other
closes.
If
we
make
a
mistake
in
one
of
them,
the
program
may
be
incorrect.
I
knew
the
problem
is
incorrect.
Don't
blame
on
the
compiler?
Don't
blame
on
the
standard?
Don't
blame!
The
machine
is
that
we
wrote
incorrect
code,
okay,
so
open
MP
and
open.
A
If
you
see
make
programmers
responsible
for
making
good
use
of
the
pragmas
and
the
and
relatives.
So
we
need
to
learn
best
practices
of
how
to
use
the
promise
and
the
closes.
We
have
reference
guide.
Quick
Start
guides
great
tutorials
that
explaining
a
lot
of
detail
the
semantics
of
its
pragma.
It's
close
it's
option
and
what
are
the
variants
between
C,
C++
and
Fortran
and
different
compilers?
But
what
is
really
missing
in
other
Mediterranean
is
how
do
we
relate?
A
How
do
we
combine
all
of
these
bricks
in
the
best
way
for
our
code,
combining
in
many
different
ways,
which
is
the
best
way
and
why
this
is
what
the
patterns
will
provide
us
the
knowledge
to
know
how
how
to
decide?
We
guide
us
in
the
decision
of
what
pragmas
and
Rosie,
who
should
we
use
and
how
we
should
combine
them
for
our
specific,
oh
okay,
so
the
patterns,
the
composition
of
the
application
in
patterns
will
help
to
make
good
use
of
open
MP
and
open
acc
will
speed
up
the
palace
ation
process.
A
This
is
important
because
if
we
don't
invest
time
in
understanding
this,
the
other
approach
is
try.
An
error.
I
write,
parallel
for
build
run,
runs
correctly,
I
think
I'm
good.
My
code
is
correct,
but
maybe
is
correct.
For
that
run,
maybe
another
run
provides
incorrect
results
or
maybe
I
change.
My
input
data,
a
different
problem
and
my
program
that
was
correct
for
a
given
data
set
now
behaves
incorrectly
and
again
it's
my
responsibility
to
do
so.
A
To
do
it
correctly,
so
it
kind
of
speed
up
the
politician
process,
because
we
invest
a
lot
of
time
trying
to
the
back
and
trying
to
find
the
bugs
and
fix
them
incorrectly
written
particle.
The
question
is:
can
we
avoid
making
the
most
common
mistakes
when
we
first
add
pragmas
and
clauses
and
avoid
these
common
mistakes
and
save
that
time
in
development
and
use
that
time?
For
other
purposes?
Can
we
do
that?
The
patterns
will
help
you
in
that,
and
also
the
patterns
are
based
on
best
practices
for
parallel
programming.
A
The
politician
studies
that
we
are
supporting
right
now
we
will
see
in
this
as
idak,
are
based
on
the
analysis
of,
for
instance,
the
coral
benchmarks.
We
have
analyzed
all
implemented
implementation
of
the
coral
benchmarks.
We
have
published
a
scientific
paper
that
can
be
downloaded
from
supercomputing
two
years
ago,
where
we
analyzed
all
of
the
implementations,
and
we
came
up
with
the
conclusion
that,
with
collaborators
from
Julis
and
orey's
national
law,
that
the
three
implant
implementation
of
we
are
supporting
right
now
are
the
most
widely
used
implementations
of
the
patterns
that
we
are
supporting.
A
Can
you
implement
it
in
a
different
ways?
Yes,
you
can,
but
we
tried
to
imitate
and
promote
this
practice
of
what
the
expert
developers
have
done
when
they
develop
the
core
advancements
and
try
to
have
that
knowledge
and
incorporated
in
a
tool
so
that
you
can
learn
from
that
and
apply
to
your
codes.
Okay,
so
based
on
that,
it
is
likely
that
if
you
choose
the
right
implementation,
if
you
for
the
given
pattern,
then
you
will
get
a
good
performance.
I
know
same
peak
performance,
because
peak
performance
is
always
very
hard,
but
good
performance.
A
We
said
this
morning
that
we
know
the
real
codes
are
large
and
complex,
so
we
need
to
approach
real
codes
from
a
different
perspective.
Let's
use
components
for
that
and
I
talked
in
terms
of
components.
So
we
start
from
the
signal
code
and
the
serial
code.
We
need
to
analyze
it
in
terms
of
component
by
component
I
mean
different
types
of
components,
one
type
of
component,
our
scientific
components.
We
try
to
avoid
always
to
reinvent
the
wheel.
A
So
we
just
have
to
link
building
executable
and
it
will
make
a
very
efficient
use
of
this
scientific
components.
Okay,
but
that
is
not
enough
if
my
problem
were
only
computing
one
FFT,
my
problem
would
be
resolved
just
calling
150,
but
you
see
usually
a
50
is
only
a
step
within
a
more
complex
simulation.
I
compute
the
FFT
I
take
the
result:
I
computer,
madam,
it
is
method
multiplication
I,
take
the
result.
Somehow
I
manipulate
these
results
to
compute
my
output,
so
these
are
just
steps
in
big
scientific
applications.
A
So
at
some
point
our
code,
what
we
need
to
do
for
our
science
will
not
be
available
as
a
call
to
a
scientific
library,
so
we
will
need
to
analyze
called
real
Kota.
We
need
to
write
and
understand
the
coal
how
this
uses
or
reduces
all
the
outputs
from
the
scientific
components
and
how
all
these
outputs
are
my
variables
all
of
them
are
combined
to
produce
my
result.
Okay,
so
I
cannot
escape
from
analyzing
what
we
call
cold
components
or
patterns.
It's
the
actual
code
that
we
write.
A
A
If
there
is
a
library
of
they're,
highly
optimized,
for
a
system
that
you
can
use,
then
you
will
need
to
find
patterns,
sections
of
code
that
are
not
available
in
the
libraries
that
would
you
will
need
to
understand
in
terms
of
parallelism
to
this
code
parallelism
and
make
a
parallel
implementation.
So
for
that
once
we
identify
the
pattern,
it
will
guide
you
to
create
parallel
patterns.
A
We
saw
that
example
with
the
PI
example
by
examples,
and
a
scalar
reduction
can
be
parallelized
as
a
parallel
scalar
reduction,
and
we
can
implement
this
parallel
scale
reduction
in
many
different
ways.
We
saw
at
least
four
or
five
implementations
in
the
demo
before
the
break,
so
how
many
pal
implementations
can
I
have
many
as
as
many
as
you
can
imagine,
just
combining
all
the
elements
that
open
and
PL
open
easy
provide
you,
okay,
so
you
will
need
to
generate
for
the
project
patterns
parallel
code.
Does
the
final
step?
A
But
here
you
have
many
possibilities
that
you
need
to
compare
and
select
the
best
one
so
how
this
process
fits
in
the
workflow
we
saw
for
OpenMP.
We
remember
that
we
said
for
real
codes.
We
need
to
first
profile.
Why?
Because
if
I
have
1
million
lines
of
code,
it
doesn't
make
sense
that
I
start
by
line
number
1
up
to
line
number
1
million,
it's
better
that
I
focus
on
that
part.
I
consumed
mode
of
this
equation
time,
especially
if
I
go
too
deep.
A
You
remember
that
we
said
remember
that
you
need
to
minimize
data
transfers,
but
you
need
to
have
significant
workload
to
flow
to
the
GPU.
To
take
advantage
of
the
beast.
The
GPU,
the
GPU
sophistic,
has
a
huge
computational
power,
so
you
need
big
problem
sizes
to
feed
that
beast
so
that
it
can't
really
compute
that
fast.
For
you,
okay,
so
begin
with
the
hot
spots.
A
Then
we
said
that
we
had
these
two
steps:
fine
in
the
hottest
spots,
those
parts
of
the
code
I
need
to
be
analyzed
for
parallelism,
decide
how
to
implement
them
in
parallel
and
make
the
actual
panel
implementation-
and
this
is
in
this
in
these
two
steps-
is
where
the
pattern
parallel
pattern
of
parallel
code.
Core
flow
fits
into
the
all
the
general
go
through
okay,
so
you
will
work
Intel
ativy
in
the
general
workflow
working
on
different
loops,
incrementally
adding
more
and
more
parallelism
to
your
code.
So.
A
Components
patterns
patterns
translate
into
parallel
code,
so
more
or
less.
We
have
already
said
this,
but
I
want
to
summarize
it
in
a
set
of
four
steps.
First,
real
epic:
we
are
talking
about
taking
real
applications
or
has
read
at
least
not
two
examples
that
we
can
use
in
training,
an
application
that
is
part
of
your
science
first,
even
if
you
have
not
done
it
do
at
least
one
profiling
just
to
double-check
that
you
are
focusing
on
the
right
part.
A
Second,
for
each
routine
contain
in
an
external
library.
What
do
you
have
to
do?
Remember
that
you
want
to
run
your
code
on
a
given
platform.
You
may
be
using
a
generic
library
that
has
been
compiled
for
a
laptop
and
that
can
be
ported
to
a
quarry,
but
probably
in
quality.
We
have
installed
a
highly
optimized
version
of
the
same
library
called
that
you
are
using
in
your
laptop.
A
So
you
need
to
consider
identifying
the
scientific
component
that
you
have
in
your
call,
this
f50
metasomatic
multiplication
by
our
abilities,
solvers
spectral
methods
and
consider
using
the
high
loot
image
version
that
you
have
in
a
given
system.
So
you
can
have
take
advantage
of
all
the
work
that
the
staff
of
the
center
has
done
for
you,
okay,
and
this
is
terminal
at
time
number
3.
A
Consider
if
you
two
just
have
to
be
aware,
if
you
are
coding
a
routine
that
this
already
available
as
a
library-
and
you
were
not
aware
of
it
or
you
had
decided
not
to
use
it
because
you
need
to
take
the
decision.
Do
I
want
to
keep
on
using
my
own
code
or
do
I
want
to
replace
this
piece
of
code
by
one
single
library
call.
That
is
how
you
tonight
for
the
system.
Ok,
it's
up
to
you.
A
What
is
better
for
your
science
or
for
your
code,
given
your
expertise,
but
you
need
to
be
aware
of
that
and
make
the
appropriate
decisions.
Ok,
so
in
that
case
you
can
consider
replacing
the
corresponding
routines
with
customized
library
calls
available
in
in
the
system
what
you're
going
to
run
and
finally,
for
the
remaining
user-defined
routines
routines
that
you
have
called
as
a
developer.
A
Do
you
need
to
address
the
complex
process
of
paralyzing
your
code,
so
for
this,
what
we
propose
is
they
compose
your
code
into
components,
particularly
in
code
components
in
patterns
and
use
what
we
will
see
next
as
a
guide
to
generate
different
parallel
versions
so
that
you
can
pick
up
the
one
that
performs
best
faster
on
a
given
architecture?
Ok-
and
this
is
the
final
step
number
four
for
the
remaining
yourself
defined
routines-
understand
the
code
compute
patterns
that
you
have
in
your
code.
Ok,.
B
A
In
the
parallel
word
technology,
we
have
probably
eight
nine
ten
patterns,
but
some
of
them
are
rarely
found
in
scientific
codes.
If
we
have,
if
we
have
them
is
because
they
appear
in
some
domains,
but
they
are
not
of
general
use
so
we'll
have
we
have
done
with
polymer
trainer
ease
provides
for
support
for
those
code
patterns
that
are
most
widely
used,
and
we
use
this
terminology
parallel
for
all
parallels,
color
reduction.
A
A
The
parallel
for
all
is
what
the
intuition
says
of
water,
for
all
is
for
all
is
typically
loop,
where
all
the
durations
can
be
executed
concurrently
in
parallel
in
any
order,
you
need
to
worry
about
dependencies
or
ordering
of
iterations
okay.
So
this
is
what,
as
part,
oh
sorry
parallel
for
all
means.
We
can
easily
represent
this
as
a
loop.
Well,
in
each
iteration
produces
a
new
value.
That
is
a
store
in
a
different
element
of
an
array.
This
is
typically
in
scientific
and
omega
computation.
A
Okay,
so
this
is
the
typical,
simple
code
that
you
will
find
when
you
recognize
a
parallel
for
parallel:
the
scalar
reduction.
Now
you
have
a
loop
where
all
the
durations
compute
a
value.
But
what
do
we
do
with
these
values?
We
don't
produce
different,
independent
output
values
in
each
iteration.
What
we
do
is
we
reduce
them
all
to
one
single
value,
using
a
some
operator
multiplication
operator
minimum
maximum
computation.
There
are
very
well
known
types
of
reduction
operations,
so
this
is
typically
represented
at
this.
We
have
the
values,
as
we
have
here.
A
Instead
of
producing
one
single
different
element
in
one
iteration,
we
will
reduce
them
all
through
this
reduction
operator.
Okay,
a
space
reduction,
it's
a
reduction
again,
we
have
a
set
of
values.
These
values
are
reduced,
but
instead
of
reducing
all
of
them
to
one
single
scalar
value,
we
reduce
them
to
a
set
of
values.
A
Okay,
this
set
of
values.
Why
is
it
called
a
sparse?
Because
the
set
of
values
that
were
these
elements
are
updated
depend
on
something
that
we
don't
know
until
runtime?
Typically,
do
you
have
experience
in
finite
element?
Calls
molecular
dynamic,
calls,
okay,
typical
in
finite
elements
or
molecular
dynamics
to
find
this
type
of
coal's.
Do
iterative
elements
or
you
iterate,
on
molecules
and
what
you
do
is
to
compute
the
interaction
between
one
molecular
one
element
with
the
neighbors.
So
you
need
to
update
that
contribution
in
the
list
of
neighbors
finite
element,
neighbors
or
molecular.
A
Never
in
named
mole,
never
molecular
weight
molecules.
So
how
do
you
represent
in
your
code?
The
list
of
the
neighbors
of
a
given
molecule,
the
neighbors
of
a
given
finite
element?
You
typically
use
it
throughout
our
silly
array,
which
is
represented
as
C
okay,
so
these
as
part
reductions
can
be
in
general
paralyze
using
similar
strategies,
then
we
use
for
a
scalar
reductions,
but
we
need
to
have
into
account
something
additional
that
is.
The
result
is
not
jaqen
value
is
a
set
of
values
that
we
only
know
what
elements
will
be
updated
at
runtime.
A
So
the
question
is:
how
can
we
handle
this
in
the
penetration
study?
Ok,
we
can
do
it
and
we
will
see
how
we
can
do
it.
Essentially,
this
is
the
use
case
that
we
are
proposing
in
the
lunation
K
practical,
that
you
will
do
after
lunch,
playing
with
and
learning
how
to
paralyze
parallel
start
reductions
that
appear
in
many
scientific
terrains.
And
finally,
we
have
added
these
recently
because
we
found
some
use
cases
that
what
disappears.
Essentially,
we
have
the
for
all
computation.
A
But
now
this
person,
8,
that
is
every
single
iteration-
produces
a
different
value,
but
they
can
eventually
sorry
compute
the
value
of
the
same
element
they
can
coli.
They
can
have
conflict
in
producing
the
same
element
of
the
array,
but
it
depends
on
the
value
of
C.
If
C
is
a
permutation,
there
is
no
conflict
per
se
is
not
a
permutation.
There
are
potential
conflicts
at
one
time.
A
Ok,
so
these
are
the
four
patterns
that
we
have
supported
and
recognizing
the
parivartana
tool,
and
now
we
will
see
how
we
can
paralyze
these
patterns
so
just
as
a
reminder
to
reinforce
the
learning
parallel
from
typically
a
loop
that
updates
all
of
the
elements
of
an
array.
Typically,
each
iteration
updates
a
different
element
of
the
array,
and
the
result
of
the
computing
of
this
pattern
is
an
array
that
is
called
the
output.
Okay.
So
how
do
we
paralyze?
This
varies
a
parallel
loop.
A
You
don't
need
to
care
to
worry
about
the
way
the
International
reorder.
You
can
reorder
in
the
most
convenient
way
for
your
purposes,
because
they
will
never
have
concrete
stress
conditions
in
correct
parallel.
Behavior
parallel
called
a
scalar
reduction.
What
you
are
doing
is
you
are
computing,
multiple
values
and
reducing
them
into
one
single
value.
That
is
called
the
scalar
reduction
variable
important
here.
You
cannot
use
any
operator.
A
You
need
to
use
an
operator
that
fulfills
two
mathematical
properties,
commutativity
and
associativity,
because
this
enables
reordering
two
plus
three
equals
three
plus
two
mathematically,
but
not
computationally.
Okay,
so
you
need
to
guarantee
that
the
operator
fulfills
these
two
properties
for
you
to
compute
the
scalar
reduction
in
parent,
okay,.
A
This
is
typically
coded
as
a
loop,
and
the
result
of
the
pattern
is
what
is
called
a
reduction
variable
or
an
output
reaction
variable,
and
here
we
have
three
different
ways
of
paralyzing
it.
We
have
seen
in
the
demonstration
before
the
break.
Essentially
it
is
a
parallel
loop,
so
the
same
way
of
coding
after
all,
but
adding
additional
synchronization.
Do
you
remember
what
the
opening
PA
stand?
Open
is
a
standard
cell
at
the
beginning,
the
programmer
responsible
for
adding
appropriate
synchronization
to
guarantee
correctness.
A
A
So
we
can
generate
many
versions,
parallel
implementations
of
the
same
code,
the
next
one
is
the
sparse
reduction.
So
remember
that
the
key
distinguishing
feature
is
the
array
of
output
and
the
sparse
nature
that
unpredictability
of
the
values
of
this
indirection,
so
a
sparse
reduction,
combines
a
set
of
values
into
a
set
of
values
using
again
a
commutative
and
associative
operator
I'm
using
a
vector
of
array
at
the
output,
not
a
single
scalar
variable.
Okay,
the
set
of
array
elements
that
will
be
updated
cannot
be
determined
until
runtime.
Why?
A
Because
only
at
runtime,
we
typically
know
the
neighbors
of
the
molecules
in
a
highly
dynamic
molecular
simulation,
the
neighbors
of
a
finite
element
in
adaptive,
finite
element
code
that
changes
the
connections
and
refines
the
mesh
of
finite
elements.
Okay,
of
course,
there
are
some
problems
where
this
array
may
have
fixed
values
for
the
whole
execution,
and
this
will
open
different
opportunities
for
the
message,
but
in
any
case
you
can
paralyze
the
Paris
production
using
the
same
strategies
and,
finally,
all
the
all
the
code
patterns
have
an
output.
A
And
finally,
it's
passed
for
all
I
will
not
stop
on
that,
because
it,
the
way
it
behaves
is
requires
different
ways
of
forcing
synchronization,
but
just
for
the
point
of
view
of
description,
it's
very
similar
to
this
path
reduction.
It
updates
the
elements
of
an
array.
The
set
of
array
elements
cannot
be
predicted
at
compile
time
it
only.
It
is
only
known
when
you
execute
the
application
for
the
input
dataset
of
that
run
of
the
application,
and
again
you
have
an
output
variable.
That
is
an
array.
Okay,
so.
A
If
we
are
able
to
take
our
code,
our
loops
our
hotspots
and
characterize
the
loops
in
terms
of
these
patterns,
what
do
we
gain?
What
are
the
benefits
that
we
obtain
okay
patterns
unable
to
ensure
correct
variable
management
in
the
parallel
code?
What
this
means
is
that
when
you
are
open
and
peel
open,
SEC
capabilities,
you
create
the
parallel
region.
You
make
the
work
sharing,
but
you
have
additional
clauses
where
you
have
to
specify
for
all
the
variables
in
the
code
what
you
have
to
do
with
them.
Will
you
make
it
them
private?
A
Will
you
share
them
will
reduce
the
amount
of
threads.
So
you
need
to
remember
that
you
need
to
specify
how
to
manage
all
the
variables
that
are
used
rather
written
in
your
code,
so
the
patterns
characterize
the
computations
for
a
given
variable,
so
it
provides
you
the
information
that
you
will
need
to
decide,
which
is
the
correct
way
to
manage
that
variable.
That
is,
the
output
of
the
butter.
Okay,
the
patterns
also
provide
algorithmic
rules
to
record
sequential
code
into
a
parallel
equivalent.
A
Once
we
know
we
have
a
reduction.
We
know
the
statements
of
my
code
that
update
the
reduction
variable
I,
know
that
I
can't
forget
about
the
rest
of
the
code
from
the
point
of
view
of
that
variable
just
need
to
protect
the
concurrent
access
to
that
statement.
That
updates
available
so
for
that
variable
I
know
how
to
manage
appropriately
with
additional
synchronization.
The
parallel
execution
thought
the
code
is
correct.
A
Okay,
so
it
provides
the
algorithmic
rules
to
generate
parallel
code
and
that's
why
parallel
world
trainer
can
do
it
for
us,
and
also
its
pattern,
has
a
set
of
policies
and
strategies
that
can
be
applied
to
it.
So
it
also
supports
generation
of
different
parallel
versions
of
our
single
sequential
code,
using
different
standards
and
different
hardware
platforms.
We
saw
in
the
dialogue
that
we
could
choose
openmp
operation,
C,
GPU,
CPU,
multi-threading,
offloading
or
tasking
paradigms.
So
all
the
combinations
of
this
is
all
the
panel
versions
that
you
can
generate.
A
A
These
are
the
strategies
that
we
say
we
have
for
each
of
the
patterns,
and
these
are
the
ones
that
we
have
implemented
in
parallel
trainer,
and
this
is
inspired
in
best
practices
in
parallel
programming,
for
instance
through
the
analysis
of
the
correlates
marks.
So
what
you
can
see
here
is
that
for
a
given
pattern,
the
floral
pattern,
when
you
go
to
the
CPU,
when
you
have
load
to
the
GPU,
to
have
one
unique
strategy
available,
that
is
the
parallel
loop.
A
This
is
simply
because
you
don't
need
any
additional
synchronization,
so
you
don't
need
to
add
in
additional
synchronization
to
guarantee
correctness
for
the
scalar
reduction
you
can
see.
We
have
three
implementations
on
the
CPU
multi-threading
and
only
two
implementations
to
a
flow
to
the
GPU.
So
in
both
cases
we
can
use
the
built-in
reduction
support
of
open,
MP
and
open
incision,
or
we
can
use
the
atomic
protection
that
we
see
by
protecting
the
statement
that
updates
the
variable,
but
on
the
CPU
you
can
also
use
Express
implementation
that
is
not
available
for
the
GPU.
A
The
reason
for
this
is
that,
and
we
will
see
next
the
specificity
session.
What
makes
this
creates
a
private
copy
of
the
variable
for
each
of
the
threads
in
the
GPU.
You
typically
have
thousands
of
threads,
so
creating
private
copies
may
incurring
a
lot
of
memory
overhead
that
make
may
make
your
program
in
efficient
made
the
program
crash
because
it
runs
out
of
memory.
Strange
things
can
happen
in
your
code
and
if
this
is
the
case
for
scalars
or
maybe
for
a
sparse
reductions,
then
it's
mandatory
for
the
spot
reduction.
A
A
So
we
can
easily
in
Korean
having
been
out
of
memory,
so
best
practices,
don't
recommend
to
use
a
specific
meditation
for
this
part
reduction
on
the
GPU
and
the
other
reason
why
it
is
not
implemented
and
supported
in
the
paragraph
a
little
okay
and
in
the
case
of
this
part
for
all,
we
are
working
on
having
this
specific
meditation
strategy
available
on
the
CPU.
It
is
not
applicable
to
GPU.
A
I
will
not
go
into
details
of
that,
and
the
other
strategies
also
are
not
applicable
to
as
password,
because
the
way
you
need
to
combine
the
the
partial
results
on
the
threads
into
the
final
result
needs
a
special
synchronization
and
additional
computation
that
are
not
valid
on
the
GPU
okay.
So
somehow
this
provides
you
kind
of
summary
table
of
all
the
possibilities
that
you
can
create
with
openmp
open
ECC
pieces.
A
Remember
that
we
don't
say
here:
open,
MP,
open,
ECC
anywhere,
we
say
multi-threaded
on
the
cpu
offloading
to
the
GPU,
because
all
of
these
combinations
can
again
be
implemented
using
open,
MP
or
open
SEC.
Okay.
So
you
have
many
possible
implementations
to
generate
and
to
test
on
your
on
your
carpet.
A
Okay,
so
let's
go
into
the
details
of
how
each
of
these
well,
we
have
not
defined
yet
is
in
detail
how
this
publication
strategies
actually
behave,
so
we
need
to
reinforce
and
learn
exactly
how
they
work.
So,
let's
begin
with
the
politician
strategy.
Parallel
for
this
is
trivial,
because,
if
a
parallel
for
all
is
found,
for
instance,
in
this
code,
this
is
the
same
code
generated
with
the
palace
nippers
generated
with
palaver
trainer
using
open
MP,
an
open
ACC
for
CPU
or
GPU.
A
So
what
you
can
see
here
is
that
in
each
iteration
you
compute
different
values
that
are
stored
in
different
memory
locations
in
different
array
elements.
So
you
have
a
for
all
pattern,
so
you
can
paralyze
it
how
just
defining
the
parallel
region.
The
first
thing
you
have
to
remember
is
for
all
the
patterns
to
go
to
make
up.
Our
implementation
is
first,
where
the
pilot
region
begins
and
ends.
If
you
are
focusing
on
the
analysis
of
loops.
Typically,
the
parallel
region
begins
right
before
the
loop
header
and
ends
right
before
right.
After
the
end.
B
A
A
You
need
to
implicitly
or
explicitly
say
this
variable
will
be
shared
among
the
threads,
would
be
private
to
its
thread
will
be
reduced,
all
the
private
local
values
of
the
threads
to
one
single
value
at
the
end,
so
you
need
to
specify
every
single
value
variable
so
in
the
open
up
implementation
of
this
more.
In
this
version,
we
force
as
best
practice
for
learning
default,
not
default
on
what
means
in
open,
MP
and
open
ACC
is
that
the
compiler,
open
and
pure
peace
compiler
will
fail
to
compile
your
code.
A
If
you
don't
specify
all
of
the
variables
that
I
use
here
either
in
a
share,
a
private
or
a
reduction
close
okay,
there
are
some
additional
variants.
First,
private
is
private.
We
don't
go
into
that
detail.
What
you
have
to
remember
is
that
all
the
variables
that
are
used,
you
are
forcing
that
you
need
to
specify
them
explicitly
here.
So
in
this
case,
all
the
variables
that
are
read-only
are
shared,
but
also
the
array
that
is
the
output
can
be
shared
among
the
threads
shared
means
that
all
the
threads
can
access.
A
Concurrently
to
the
array
at
every
iteration
will
be
accessing
a
different
element.
You
don't
have
Twitter
different
threads
accessing
to
the
same
element.
At
the
same
time,
that
situation
cannot
occur
because
the
parallel
loop
parallel
for
all
pattern
guarantees
that
cannot
occur.
If
the
analysis
is
correct
in
terms
of
patterns,
it
is
safe
to
just
create
a
parallel
region.
A
Schedule
declarations
of
the
loop
in
the
order
that
is
better
for
your
code,
run
it
in
parallel
and
the
code
will
be
always
correct.
Okay,
this
is
the
power
of
the
parallel
for
all
pattern
and,
additionally,
here
again
for
the
sake
of
promoting
and
helping
in
learning,
we
force,
we
add
to
the
schedule
to
a
for
the
Clause,
the
clause
schedule.
What
this
means
Duke
will
have
several
options
to
map
the
iterations
of
the
loop.
A
In
different
orders
to
the
to
the
threads,
as
you
all
raise
your
hands
when
I
asked,
if
you
have
written
MPI
code
here,
essentially
what
you
can
make
is
a
block
distribution
of
iterations
among
the
threads,
a
cyclic
distribution
of
the
trations
amount
of
thread,
a
cyclic
one
cyclic
to
cyclic
and
distribution
of
the
durations
in
blocks
among
the
threads.
So
the
same
concepts
are
how
they
apply
to
data
distributions
in
MPI
implementations.
A
The
same
concepts
are
applied
here
to
the
specification
of
the
way
that
the
iterations
of
the
loop
are
mapped
are
assigned
to
the
threads
in
within
the
current
parallel
region.
Okay,
so
we
just
put
here
Auto
these
delegates,
choosing
the
right
schedule
to
the
compiler,
but
you
can
edit,
here
and
bright
static
static,
one
dynamic
runtime
for
five
options
that
you
can
easily
change
to
make
different
experimentation
with
your
code.
Okay,.
A
So,
in
terms
of
concepts
with
a
parallel
loop,
you
can
specify
where
the
parallel
region
begins
and
ends
all
the
variables
falls
in
the
foreground.
If
they
are
shared
private
or
array,
then
in
particular
the
output
array
of
the
parallel
loop
needs
to
be
shared,
because
there
is
guarantee
that
no
rest
condition
will
will
appear
and
also
you
need
to
play
with
the
work-sharing
construct
and
modify
default
behavior
using
the
schedule
cross.
A
A
Great,
so,
let's
move
on
to
the
first
type
of
synchronization
that
we
need
to
add
to
this
fully
parallel
loop
to
this
parallel
form.
That
is
parallel
loop
with
built-in
reduction
again
here
we
have
a
code
that
we
already
know
that
is
the
computation
of
pi.
The
same
example
we
used
before
the
break,
and
here
they
will
the
reduction.
Is
this
variable
sum
that
is
taking
the
Rif
final
result
of
salmon
all
the
elements
that
are
produced
while
evaluating
these
iterations?
A
This
is
pression
for
different
values
of
I
and
I
is
the
loop
index.
So
in
general
we
have
a
reduction,
a
scale
reduction,
that
is,
each
iteration
produces
a
different
value,
a
different
value
and
in
the
end,
at
the
end
of
the
loop,
we
want
to
reduce
them
all
with
some
reduction
in
this
case
to
one
single
final
value.
Okay,
so
again,
the
loop
is
characterized
by
a
scalar,
parallel,
scalar
reduction.
How
do
we
translate
this
into
parallel
code?
A
Again,
definition
of
the
loop
region
begins
and
ends
in
the
limits
of
the
loop
before
and
after
the
beginning,
and
the
end
of
the
loop
again
forcing
with
a
fault.
None
we
are
forcing
the
specification
of
all
the
variables
that
are
used
in
the
interval
in
the
in
the
loop
here.
Note
that,
if
you
specify
do
declare
some
variables
inside
the
loop,
there
is
no
need
to
specify
them
in
the
palette
region,
because
the
variable
is
not
declared
here
is
not
available
here.
A
It's
local
automatically
to
it
the
thread
that
has
been
assigned
that
iteration
okay,
so
even
the
way
we
declared
the
variables
in
our
code
can
help
us
to
make
simpler,
the
openmp
or
the
openness
implementation.
If
we
declare
X
outside
of
the
loop,
we
need
to
add
X
here
as
a
private
variable.
If
we
declare
it
inside
and
the
cause
is
still
correct.
A
So,
as
a
rule
of
thumb,
all
there
is
only
variables
that
you
can
find
here
in
big
codes.
All
the
variables
that
are
not
written
that
are
read
all
of
them
need
to
be
sure.
Okay,
only
those
variables
that
are
written
need
to
be
somehow
managed
with
additional
synchronization.
In
this
case,
the
only
variable
that
is
written
apart
from
a
greatest,
not
that
is
only
declared
within
the
loop
body.
Stroke
is
the
variable
soon,
and
the
pattern
is
telling
us
that
soon
is
the
reduction
variable.
A
So
we
know
that
in
order
to
paralyze
it,
the
additional
synchronization
you
need
to
add
is
to
put
for
reduction
plus
consuming
columns.
So-
and
this
is
instructing
the
compiler
to
generate
the
synchronization
to
make
the
reduction
of
these
values.
Okay,
so
share
variables
again
the
work
sharing
with
for
schedule,
Auto
and
disable.
We
have
commented
the
reaction,
the
variables
that
are
wet
only
in
particular
the
variables
that
is
the
output
of
the
pattern.
We
know
exactly
what
we
need
to
do
with
it.
A
In
a
modern
sickle,
you
can
do
this
important.
You
have
many
more
restrictions.
Do
you
typically
write
coding
in
Fortran
in
C
in
C?
C99
allows
you
to
do
that
so
any
more
than
C.
You
should
be
able
to
do
this
with
no
problem
in
Fortran,
depending
on
the
portent
flavor
that
you
use,
you
may
be
allowed
to
do
this
or
you
may
be
forced
to
declare
all
the
variables
at
the
beginning
of
the
function.
So
you
are
forced
to
manage
the
data
scoping
of
those
variables
in
the
closest
of
the
openmp
pragma.
A
A
What
is
what
we
don't
know
here
is
exactly
how
all
the
set
of
element
these
expressions
are
reduced
and
the
other.
This
is
up
to
the
compiler
to
up
to
make
the
reduction
operations
in
the
order
that
are
optimal
for
a
given
platform.
Okay,
so
we
delegate
on
the
compiler
the
way
the
order
in
which
all
the
elements
produced
by
the
threads
are
really
combined
and
we
don't
care
about
it.
The
compiler
will
do
a
good
job
at
that
and
we'll
current
it
that
the
result
is
correct.
A
A
A
They
don't
how
they
say
this:
they
don't.
If
you
don't,
let
say
the
different
is
a
different
way.
If
you
don't
guarantee
that
when
the
threads
access
to
the
shared
variable
s,
do
it
in
an
exclusive
way,
so
that
when
the
thread
is
reading
the
value
0
having
the
value
plus
1
and
the
result
is
storing
it
in
the
same
shell
variable
if
another
thread
can
interrupt
that
process.
What
you
can
do
is
that
you
can
have
incorrect
final
value
of
the
sum,
quite
because
mutual
exclusion
atomicity
has
not
been
guaranteed.
A
Ok,
so
the
way
to
call
this
Impala
with
atomic
is
it's
read
all
the
threads
will
access
to
the
shared
variable.
So
during
the
execution
of
the
parallel
loop
concurrently,
you
will
have
thread
0
reading
the
value
of
s
swimming,
adding
a
value
under
storing
the
result
in
the
same
in
the
same
location.
At
the
same
time,
the
number
one
can
do
the
same
at
the
same
time
thread
number
two
can
do
the
same,
so
what
you
need
to
do
it
only?
A
What
you
need
to
do
is
to
protect
this
plus
equal
operation
with
atomic
Tomic
means.
Is
that
whenever
the
thread
0
is
doing
this
plastic
ball
operation,
the
rest
of
the
threads
will
be
waiting
until
thread.
Zero
finishes
the
plastic
cooperation
when
it
finishes
it
keeps
on
working
and
then
other
thread
is
granted
access
to
the
mutual
exclusion
section
so
again,
without
interference
from
other
thread.
He
computes
the
plus
equal
operation.
Okay,
so
these
guarantees
atomicity
the
operation
of
the
plastic
one.
If
we
don't
guarantee
this,
the
result
in
general
will
be
correct.
A
So
in
parallel,
what
we
have
is
all
the
threads
in
the
different
iterations
are
doing
different,
plus
equal
operations
on
the
same
shell
variable
thousands
of
them.
So
we
need
to
secure
thousands
of
atomic
instructions
to
protect
the
part.
So
intuitively
you
can
see
that
there
is
a
lot
of
additional
synchronization
that
you
are
adding
using
this
strategy,
because
every
single
plastic
ball
operation
needs
to
be
atomically
protected.
Okay
and
the
number
of
atomic
operations
that
you
are
seen
is
proportional
to
the
problem
size.
A
If
you
have
1,000
ratios
one
thousand
atomic
20
billion
operation,
iterations
20
million
AutoFix.
So
the
politician
overhead,
the
amount
of
synchronization
crisis
when
you
rise
the
problem
size
and
when
you
write
the
problem
size
usually
is
because
you
want
to
go
to
the
GPU,
so
you
will
need
to
need
to
find
a
balance,
a
trade-off
when
you
use
this
strategy
on
the
CPU
or
on
the
GPU,
okay,
clear,
the
concept
of
okay,
mutual
exclusion.
So,
okay
from
the
point
of
view
of
implementation,
again,
we
repeat
the
same
series
of
things.
A
Definition
of
the
parallel
region
enclosing
the
loop
that
we
have
analyzed
in
terms
of
code
patterns
shared
variables.
All
there
is
only
variables
are
shared
again
work
sharing
all
the
loop
iterations
are
shared
among
the
threads
according
to
a
schedule
auto
to
delegate
it
to
the
compiler,
or
we
can
specify
a
static
static,
one
dynamic
runtime.
A
Okay,
so
you
can
see
that
there
are
many
common
steps
in
implementation
shared
among
all
the
patterns,
particularly
the
definition
of
the
parallel
region,
the
rules
to
defer
to
determine
what
variables
are
shared,
the
work
sharing
and
also
how
to
implement
and
to
Atami
to
make
it
add
the
synchronization
to
protect
those
parts
of
the
code.
I
need
that
are
sensitive
in
the
peril
execution
and
we
have
all
the
information
by
using
these
patterns.
Okay,.
A
A
Now
we
have
a
slightly
different
scenario.
We
are
still
have
the
shared
memory
and
the
shared
variable
so
that
all
the
threads
can
access
at
any
time
to
the
share
very
well,
but
now
we
call
it
explicit
meditation,
because
what
we
are
doing
is
we
create
exactly
a
copy
of
the
shape
variable
in
it
of
the
threads.
So
it's
read
has
its
private
data,
its
private
copy
of
the
same
share
data.
If
the
shell
data's
a
scalar,
it
sure
has
a
private
scalar.
A
Okay,
so
this
is
a
key
difference
between
them,
because
we
are
removing
all
the
atomic
protection
that
is
needed
during
the
computation.
The
threads
work
complete
independently
from
the
rest
of
the
threads
with
no
synchronization.
Where
is
the
police
station
overhead
in
the
amount
of
memory?
Where
is
the
position
overhead
of
the
atomic
protection
strategy
in
atomic
operations?
And
you
don't
incur
in
additional
memory
so
paralyzation
in
terms
of
penetration
overhead?
A
We
always
need
to
find
a
trade-off
between
additional
synchronization
with
atomic
or
mutual
exclusion
and
additional
memory
to
create
private
copies
of
variables
to
the
couple.
The
execution
in
parallel
of
the
different
threads
and
having
these
two
things
in
mind
and
finding
the
right
balance
for
the
code
is
where
you
can
really
create,
is
how
you
can
really
create
a
very
efficient
part.
Implementation
of
your
code
in
the
privatization,
is
one
of
the
most
effective
ways
to
really
implement
a
scaleable
parallel
code
in
real
applications.
A
So
during
the
computation,
no
atomic
protection
is
needed.
But
of
course
it's
thread
at
the
end.
Has
its
own
private
copy
has
a
partial
sum
of
the
final
result.
So
we
need
to
do
something
else
that
was
not
needed
in
atomic
protection.
What
do
we
need
to
do?
Its
thread
contributes
to
the
shared
memory,
shell
copy,
its
private,
a
local
result
and
a
secure.
The
final
private
partial
result
is
sum
to
the
share
memory.
A
Here
we
need
atomic
protection,
so
in
this
case
we
only
need
as
many
atomic
protections
as
number
of
threads
that
we
have.
In
contrast
to
the
problem
size,
you
can
have
1
billion
iterations
and
for
threads
for
italics
in
the
others.
In
the
other.
In
the
other
approach,
you
have
1
billion
iterations
1
billion
atomic
for
4
threads,
ok,
so
it's
always
a
trade-off
between
the
amount
of
memory
to
reduce
or
remove
synchronization
and
the
amount
of
synchronization
that
you
need
or
the
minimum.
You
need
to
guarantee
correctness.
A
Okay,
so
with
a
specific
motivation
by
explicitly
creating
private
copies
for
its
a
thread
of
the
original
variable,
do
decouple
the
execution,
do
the
computations
of
very
fast
but
is
very
efficient,
and
you
only
need
to
synchronize
at
the
very
end,
and
here
is
what
you
can
do
with
the
parallel
with
the
parallel
trainer.
This
again,
the
same
by
example.
Before
going
to
the
case
of
a
race,
let
us
start
with
a
simple
case
of
a
scalars.
We
have
the
same
by
example
that
you
already
know
so
what
has
happened
here?
A
The
tool
has
created
the
parallel
region
has
created
the
worth
sharing
and
now,
instead
of
making
atomic
protection
or
of
making
close
reduction,
it
has
created
a
preamble
before
the
loop
that
is
create
a
private
copy
for
its
read
now
in
the
loop
they
stay
referring,
the
use
of
the
original
share
variable
has
been
replaced
with
usages
of
the
thread
local
variable.
So
here
all
the
threads
are
working
independently
with
no
synchronization
alone.
A
Once
they
finish,
then
they
take
it
thread.
Using
atomic
protection
update
deprive
the
final
share
variable
with
the
private
result
using
a
portable.
So
somehow
the
specificity
session
would
make
this,
for
the
original
loop
creates
three
stages
in
the
pal
implementation:
the
same
loop,
replacing
the
original
uses
of
the
reduction
variable
with
private
copies,
a
preamble
to
declare
the
private
copy
and
initialize
it
and
app
Istanbul
to
reduce
to
compute
a
final
result
using
the
thread.
Local
partial
results
completely
by
three.
This
is
what
you
see
here:
preamble
main
body
and
the
postable
okay.
A
This
applies.
This
can
be
applied
to
scalars,
but
can
also
be
applied
to
arrays,
and
this
is
what
you
will.
Probably,
if
you
complete
the
rule
as
practical,
you
will
be
doing
this
with
a
quite
complicated
:,
but
you
simpler
trainer
again
the
same
loop,
parallel
region,
work
sharing
with
the
schedule
again,
the
uses
of
the
global
variable
y,
replaced
by
the
uses
of
the
private
copy
preamble,
allocate
a
private
copy,
but
now
it's
not
just
declaring
a
scalar.
A
You
need
to
allocate
the
memory
and
you
need
to
initialize
each
of
the
elements
of
the
array,
so
the
tool
will
generate
this
call
for
you,
so
the
preamble
is
about
generating
private
copies
with
all
the
elements
that
available
the
original
variable
house,
if
it
is
a
scalar,
is
trivial.
If
this
array
is
as
many
elements
as
original
reference,
okay,
so
here
it
come,
can't
leave
it
thread,
works
without
interacting
with
any
other
thread
on
its
private
copy.
A
And
finally,
we
make
critical
is
another
way
that
we
can
use
in
open
MP
to
guarantee
atomicity
mutual
exclusion.
What
this
means
is
that,
when
one
thread
enters
in
this
critical
section,
only
one
thread
will
be
updating
their
secuence.
The
original
shell
variable
with
the
values
computed
locally
when
it
finishes
the
rest
of
the
threads,
are
waiting.
Another
thread
is
granted
access.
Another
thread
enters
compute
this,
while
the
rest
are
the
remaining
are
waiting
and
so
on.
Until
all
the
threads
are
granted
access
sequentially
to
compute
this
part.
A
So
in
the
end,
what
you
have
is
the
same
result
as
in
the
scalars.
The
original
variable
computing
the
global
result
of
the
ring.
Okay.
This
is
what
you
see,
for
instance,
in
the
real
newness
application
of
the
color
beige
box,
the
Holly
they
make
further
image
sensors
by
reducing
the
amount
of
memory
the
use
here,
but
essentially
the
concepts.
The
best
practices
recommendations
is
what
you
can
find
in
our
tool
today.
This
comes
from
that
work
that
we
published
with
all
written
with.
A
As
part
reaction,
okay
and
remember
that,
for
a
spot
reduction,
you
usually
don't
have
built-in
support
in
the
standards.
There
is
some
differences
for
Fortran
and
C
something,
but
in
general
you
should
consider
that
there
is
not
support
to
make
reductions
on
arrays
in
openmp
Nopec,
apart
from
some
exceptions
that
we
do
have
in
the
standards.
Okay.
So
that's
the
reason
why,
for
a
race
for
a
spot
reduction,
you
need
to
use
a
lot
of
it:
protection
on
the
GPU
around
the
multi-threaded
CPU
and
on
the
GPU
on
this.
A
Multiple
CPU
can
use
this
specification
study
that
cannot
be
applied
to
GPU,
because
we
said
we
will
create,
allocate
private
copies
of
the
whole
array
for
thousands
of
threads
exploding
memory
usage
and
will
easily
run
out
of
memory.
Okay
with
the
application
may
crash.
Okay,
that's
why
we
don't
support.
We
don't
recommend
using
this
strategy
for
the
GPU.
B
A
A
Of
each
of
these
strategies
in
general,
remember
that
the
only
one
that
has
no
secret
session
overhead
is
parallel
loop
and
if
this
easy
to
implement
it
is
great
because
you
don't
need
any
reason
or
synchronization
the
analysis
in
terms
of
patterns
current
is
that
each
iteration
writes
on
a
different
memory.
Location,
no
need
to
worry
about
potential
waste
conditions
or
incorrect
behavior.
A
The
built-in
reduction
is
great.
If
you
have
support
in
open,
MP
or
open
SEC
is
similar,
as
in
MPI
you
have
in
MPI,
you
have
MPI
reduce
MPI
reduce,
is
an
implementation
for
reduction
operation
across
the
MPI
racks,
so
somehow
reduction
operations
are
so
common
that
all
the
parallel
programming
tools
have
some
built-in
support
for
reduction
operations.
The
question
is:
would
you
return
the
internalizes?
What
reduction
operations
are
supported
by
the
tool
you
are
using?
Because
if
you
did,
the
operation
is
supported?
It's
great.
A
You
just
call
use
the
visiting
support
and
everything
will
work
just
fine.
If
it's
not,
then
you
need
to
use
alternative
implementations
now
the
most
recent
versions
of
OpenMP
unable
to
provide
user-defined
reduction
operation,
but
this
has
come
in
open,
p5,
I.
Think,
and
here
we
are
considering
up
to
open
MP
4.5
so
up
to
4.5.
You
didn't
have
that
feature,
but
this
is
something
that
is
coming
in
the
next
upcoming
releases
of
compilers.
A
But
anyway,
if
you
decide
not
to
use
it
or
you
don't
want
to
use
those
features
or
you
have
them
available,
you
still
have
two
other
strategies
that
you
can
use.
The
atomic
is
very
easy
to
understand.
You
don't
need
to
change
the
code.
You
just
need
to
execute
the
code
fully
parallel
and
those
operations
that
are
the
reduction
operations
in
add
a
synchronization
to
guarantee
atomic
protection
intuitively.
A
The
specificity
session
has
drawbacks
in
terms
of
memory.
You
are
using
more
memory,
much
more
memory
potentially
for
arrays,
but
this
allows
you
to
remove
synchronization
and
reduce
the
synchronization
overhead
to
a
number
of
operations.
Atomic
operation
that
is
proportional
to
the
number
of
threads,
not
the
problem
size,
so
you
can
scale
to
very
large
problem
sizes
for
your
science
and
your
application,
parallel
per
implication,
will
still
a
scale
in
performance.
A
As
far
as
we
understand,
the
built-in
support
handle
by
the
compiler
behaves
more
or
less
like
this
PC
privatization,
because
you
do
measure
performance
is
they
are
more
or
less
very
similar.
Well,
we
expect
from
the
compilers
that
they
are
able
to
make
more
optimized
implementation
teachers
of
the
final
reduction
operation
that
we
are
implementing
right
now
in
parallel
trainer.
There
are
ways
to
do
the
reduction
using
trees
using
scheme
that
allow,
instead
of
doing
the
final
reduction
sequentially
one
thread
after
the
other.
You
can
somehow
do
that
in
parallel
in
different
stages.
A
So
compilers
are
supposed
to
do
that,
optimizations
for
the
cargo
platform
that
you
have
and
but
of
course
you
can
find
an
optimized
implementation
of
a
specificity
session
by
optimizing.
The
amount
of
memory
that
you're
locate
is
allocating
a
full
copy
of
the
array.
You
can
allocate
less
memory
as
long
as
that
memory
is
used
on
quality
reference
in
your
code
and
you
can
reduce
the
amount
of
synchronization
overhead
by
implementing
some
kind
of
tree
reduction
in
the
final
part,
but
I
would
make
these
learning
these
concepts
a
bit
more
complicated.
A
So
we
didn't
you
have
decided
not
to
implement
that
in
the
version
we
have
available
so
far,
okay,
so
more
or
less
this
us.
This
is
the
everything
you
have
so
you
can
in
the
practicals.
You
will
be
playing
with
examples
that
has
examples
of
all
the
three
patterns.
So
when
you
modify
a
loop,
you
can
generate
a
different
version
of
your
code.
Where
for
its
loop,
you
can
generate
all
of
these
versions.
So
in
the
lissome
case
you
have
12
loops,
12
loops,
it's
a
loop,
you
can
apply
two
or
three
strategies.
A
You
could
generate
up
to
40
different
versions.
Parallel
branches
of
your
code,
okay,
by
just
combining
different
strategies
applied
to
different
loops
across
the
whole
code.
So
doing
that
by
hand
is
very,
very
time
consuming.
So
the
trainer
helps
a
lot
in
learning
and
in
producing
code
and
is
making
this
permutation
and
helping
in
the
implementation
process
of
implementing
all
of
these
variants.
Of
the
parallel
go.
Okay,
one
thing
we
will
not
explore,
but
you
will
do
have
in
the
in
the
documentation.
A
It's
something
we
have
already
added
in
pali
word
21.2
the
version
you
have
in
Kali
in
curry.
Very
briefly,
we
have
added
support
for
tasking
dustiness,
another
paradigm.
That
is
great
in
a
lot
of
interest
in
some
scientific
domains.
So
it's
a
possibility
and
you
have
there.
You
can
use
again
more
possibilities
for
the
different
strategies
for
the
different
patterns
to
generate
more
versions
of
the
code,
and
this
is
more
or
less
goggle
generating
right
now,
the
same
PI
loop
can
be
implemented
in
parallel.
A
Instead
of
using
for
to
view
the
work
sharing,
the
work
sharing
is
simply
stood
on
by
creating
tasks
that
are
finally
synchronize
with
tasks
wait.
This
is
the
tasking
support
typical
of
OpenMP
3.5.
You
have
it
available
in
training
1.0,
and
also
we
have
added
support
for
the
test
load
pragma
in
openmp
4.5.
So,
with
this
implementation
of
this
syntax,
you
have
also
a
task
in
implementation
of
the
of
the
same
goal:
okay,
okay.
This
is
just
for
your
reference.
B
A
Ok,
any
questions
so
concepts
pattern,
strategies
that
are
applicable
to
its
pattern,
so
different
ways
of
implementing
it,
and
then
you
can
implement
each
of
these
strategies
using
a
choice
of
OpenMP
operation,
sea
of
loathing,
multi-threading
tasks
in
GPU
CPU.
So
you
have
a
lot
of
possibilities
to
generator
to
play
for
your
goals.