►
From YouTube: 1 Introduction to OpenMP/OpenACC and GPUs, & 2 Accelerating with OpenMP OpenACC Copy 01
Description
Parts 1 and 2 from the Parallelware Trainer Tool workshop at NERSC on June 6, 2019. Slides are available at https://www.nersc.gov/users/training/events/parallelware-tool-workshop-june-6-2019/.
A
We
are
really
pleased
to
be
here
presenting
what
we
are
doing
in
from
Spain
in
upendra.
That
is
essentially
creating
a
new
set
of
tools
based
on
this
parallel
word
technology
that
is
unique
from
a
painter.
That's
why
this
startup
was
incorporated
on
unfunded
and
essentially,
what
you
will
be
seen
today
is
how
to
use
this
tool
to
learn
best
practices
about
parallel
programming,
with
open,
MP
and
open
in
C
for
GPU
programming
and
learning
this
based
on
a
different
approach
that
is
called
based
on
understanding
your
code
from
the
point
of
view
of
patterns.
A
So
we
will
try
to
see
and
cover
all
of
this
approach
so
that
you
can
really
understand
how
the
Tool
Works-
and
you
really
understand
why
the
tool
is
able
to
do
what
it's
doing
right
now
and
I.
We
also
try
to
set
up
expectations
on
your
side
on
what
the
tool
can
do
right
now,
but
we
are
working
to
have
new
features
during
this
year
and
what
are
our
expectations
for
the
tool
for
your
next
releases?
A
Okay,
so,
as
I
said,
we
expect
you
to
learn
essentially
a
different
approach
to
parallel
programming,
not
just
looking
at
your
code,
looking
at
your
instructions,
the
dependencies
between
them
and
try
to
insert
instructions
to
somehow
try
and
test
is
everything
works,
just
fine,
and
if
you
find
that
the
code
doesn't
work,
try
to
find
out
where
the
problem
is
and
fix
it
like
modifying,
closes
and
pragmas.
This
process
is
time-consuming,
is
a
real
problem.
A
It
takes
a
lot
of
debugging
effort,
so
we
want
to
somehow
avoid
such
a
LeFort
in
that
part
of
the
development
workflow
by
understanding
from
the
very
beginning
how
your
code
behaves
and
how
this
behavior
can
be
used
to
understand
how
to
code
your
code
from
sequential
into
a
parallel
version
that
is
correct,
and
that
is
performant.
Okay.
So
for
that
purpose,
what
we
expect
you
to
learn
today
is
learn
how
to
decompose
real
codes
into
these
parallel
patterns.
A
For
this
purpose,
we
have
prepared
a
simplified
version
of
the
well
known
knowledge
benchmark
of
the
current
benchmarks
from
hydrodynamics
the
scientific
field,
we're
using
this
tool,
this
version
of
the
code.
You
can
see
that
you
can
address
complexity
in
the
code
and
you
can
Perl
try
to
paralyze
with
a
train
and
learn
how
to
paralyze
more
or
less
real
codes.
Let's
see
how
we
can
understand
the
limitations
on
the
current
benefits
of
the
tool.
A
Not
only
openmp
open
easy
for
GPU,
so
you
will
learn
concepts
that
are
properties
of
your
algorithm
of
your
code,
independently
of
the
Harvard
platform
that
you
are
using.
Ok,
this
is
the
power
of
this
pattern
based
approach
and
what
we
have
learned
we
have
done
in
the
tool.
We
have
collaborated
with
Center
from
original
lab
nares
Barcelona,
supercomputing,
Center
Julie's.
In
Germany
and
we
have
collected
and
come
to
an
agreement
with
all
of
these
people
of
what
are
some,
how
best
practices
for
parallel
programming
using
open,
MP
and
open
ACC.
A
Somehow
what
is
good
for
you
when
you
want
to
implement
up
code
in
parallel,
but
we've
got
kind
of
kind
of
implementations.
You
can
expect
to
have
good
performance.
What
kind
of
implementations
you
cannot
expect
to
have
good
performance
of?
Why?
Okay,
so
this
best
practice
of
a
parallel
program
is
something
that
we
will
learn
and
somehow
discuss
during
the
training
now
India
in
before
Rome.
So
sure
this
is
know,
oh,
you
can
use
the
patterns
approach
both
for
CPU
and
for
GPU,
okay
I'm,
giving
the
PAL
Warriner
tool.
A
You
will
see
that
as
you
can
set
up
and
choose
between
different
CPU
GPU
platform,
open
MP,
open
ECC
standards
and
even
different
programming
paradigms
like
multi-threading
offloading
and
tasking
okay.
So
we
will
see
that
in
the
demonstration
of
the
tool
and,
of
course,
do
not
hesitate
to
interrupt
me
at
any
time.
I
will
be
really
pleased
to
answer
your
questions,
so
the
agenda
for
today
before
the
break
I,
will
try
to
introduce
the
minimum
set
of
concept
that
you
need
to
understand
to
go
from
the
CPU
to
the
GPU.
A
We
will
not
go
into
details
of
hardware.
We
will
not
go
into
details
of
the
semantics
of
the
pragmas
of
the
closest.
You
will
learn
that
by
practicing
and
learning
what
the
tool
is
doing,
you
will
see
the
tool
doing
some
work.
You
will
analyze
team
put
on
the
output
of
the
tool
and
you
will
be
able
to
learn
all
of
those
issues.
So
here
we
only
want
to
introduce
the
key
comes
to
the
distinguish:
differentiate,
CPU
programming
from
GPU
programming.
A
So
you
have
the
minimum
set
of
knowledge
that
you
need
to
understand
to
make
good
programming
of
the
GPU,
and
then
we
will
show
you
how
to
read
code
using
open,
MP
and
open
ECC,
and
here
I
will
do
a
demonstration
of
the
tool
I
walk
through
through
the
graphical
user
interface.
So
you
can
have
a
first
feeling
of
how
the
tool
looks
like
I
saw
to
use
it.
Okay,
later
after
the
break,
we
will
focus
on
the
key,
a
theoretical
concept
of
the
patterns,
again
very
light
way.
A
There's
the
key
concept
you
need
to
understand
and
with
the
help
of
the
tool,
you
will
be
able
to
recognize
these
patterns
and
apply
different
policies
and
strategies
for
its
pattern,
and
here
you
will
be
able
to
do
by
yourself
a
practical
repeating
the
code
that
I
have
used
in
my
demonstration,
the
pile
so
that
you
on
your
own.
You
follow
in
the
worksheet
that
Helen
and
I
will
share
with
you
during
the
morning.
A
You
will
be
able
to
follow
a
set
of
steps
to
get
used
to
the
tool
to
generate
multiple
versions
of
pi
measure,
the
performance
of
these
different
versions
on
quarry
using
the
CPU
using
the
GPU
and
with
all
that
knowledge,
to
be
able
to
compare
and
pickup,
which
you
think
is
the
best
implementation
that
provides
your
best
performance.
Okay.
So
we
will
use
the
same
example
that
have
been
used
here.
A
A
The
idea
is
that
if
you
have
a
code
that
you
are
interested
in
trying
to
understand
all
the
compost
in
terms
of
patterns
and
see,
if
you
can
approach
the
palace
ation
of
the
tool
of
your
code
with
our
tool,
then
we
will
sit
with
you
to
see
how
you
can
get
the
first
approach
to
using
the
tool
with
your
code
for
doesn't
have
a
code.
We
have
prepared
a
simple
example
of
the
simplification
of
the
Lula's
collar
Benzema,
again
with
a
worksheet
with
a
detailed
set
of
steps.
A
So
you
can
cover
everything
you
need
to
create
a
GPO,
enable
open,
a
CC
version
and
open
MP
version
that
runs
faster
than
the
GPUs
of
Kali.
Ok,
so
you
have
never
heard
about
this
code.
You
don't
know
what
I
do
dynamics
is
about
it,
not
my
field.
You
will
not
need
that
does
important
thing.
You
need
to
learn.
It's
only
about
the
code.
How
do
you
code
your
algorithm?
What
are
the
properties
of
this
code,
and
this
puts
a
set
of
limitations
on
on
how
you
can
paralyze
your
code,
so
they
call
the
components.
A
A
Who
of
you
here
have
program
already
the
GPU
someone
has
already
problem
is:
if
you
someone
has
someone
has
used
openmp,
no
OpenMP,
no
open
sec,
no
parallelism,
okay,
great,
then
it!
So
it's
it's
a
good
approach
to
have
this
content
here,
because
we
begin
from
the
very
basics
we
don't
assume
any
previous
knowledge
on
parallel
programming
and
GPU
programming
or
multi-core
program,
so
feel
free
to
ask
questions
because
this
course
is
for
you.
We
begin
from
scratch.
Okay,
so
let's
go
on
to
lecture
number
one
introduction
to
open,
MP,
open,
ACC
and
deep
use.
A
Ok,
GPU
is
kind
of
turned
in
topic.
Everyone
wants
to
program
the
GPUs,
because
the
colleague
is
also
working
on
GPUs
and
is
getting
amazing
speed.
Ups
on
the
code.
Okay,
so
somehow
the
GPUs
are
trending
topic
and
it
is.
It
is
supposed
to
be
kind
of
the
future
for
to
get
good
performance
or
pit
performance
on
pixel
scale
and
XS
k
supercomputers
the
machines
that
are
being
manufactured
now
that
will
come
in
the
next
generation.
Why?
Because
the
CPU
usually
takes
a
lot
of
power.
The
GPU
consumes
less
power.
A
Exascale
roadmaps
that
is
building
the
next
generation
of
the
most
powerful
supercomputers
in
the
world.
Mostly
all
of
them
I
think
this.
This
this
this
is
from
November
9
out
of
the
10
super
computers
use,
accelerators,
5
of
them
are
using
GPUs.
Others
are
using
that
different
types
of
accelerator.
So
if
you
want
to
pull
your
code
to
one
of
these
machines
to
make
big
science,
then
somehow
you
will
need
to
take
advantage
of
the
GPUs
to
get
resources
allocated
for
your
go.
Ok
and
part
from
that
GPUs
are
ubiquitous
in
my
laptop
I.
A
Even
have
a
GPU,
so
I
can
install
the
full
software
stack
and
use
open,
ECC
or
other
programming
languages
to
aster
in
my
code
on
my
laptop.
So
it's
something
that
somehow
you
need
to
learn,
because
at
some
point
you
will
need
to
port
your
calls
for
pick
performance
to
make
good
science
and
big
science.
A
So
what
is
the
GPU?
The
GPU
is
designed
is
a
hardware
specialized
and
designed
to
make
massive
floating
point
operations.
So,
while
the
host
or
the
CPU,
you
usually
see
it
as
a
set
of
floating-point
units
with
more
or
less
that
are
more
or
less
limited
in
time.
In
my
in
in
number
in
the
GPU,
you
can
see
it
as
thousands
of
floating
point
unit
operations,
independent
of
each
other
that
you
can
use
at
the
same
time.
A
So
this
provides
all
doing
many
things
at
the
same
time,
in
different
pieces
of
the
harbor
provides
a
lot
of
computational
power
that
you
can
use
to
accelerate
your
goals.
Okay,
so
somehow
what
you
can
imagine
is
that
you
need
to
somehow
specify
how
your
code
can
be
executed
in
vector
instructions.
Okay,
somehow
you
can
just
remember
this
idea.
Vector
instruction
is
something
that
if
you
execute
one
is
the
one
instruction
on
a
vector
unit,
you're
using
only
one
of
the
lanes
of
the
vector
unit.
A
So
four
lanes
want
one
instruction:
you
are
not
using
three
or
four
14-point
units,
so
imagine
that
you
have
thousands
of
them.
If
you
run
sequentially,
you
cannot
expect
to
have
good
pic
performance
out
of
a
design
of
hardware
like
the
GPU,
so
don't
run
sequential
code
on
the
GPU,
because
it
will
typically
be
a
special
over
matter
slower
than
running
a
simple
multi
threaded
code
on
the
GPU
on
the
CPU,
and
apart
from
that,
we
will
not
go
into
those
details.
A
I
will
only
introduce
you
some
of
the
concepts
you
need
to
understand
why
the
GPU
has
a
complex
memory
design.
The
reason
why
it
is
so
powerful
is
that
all
of
these
thousands
of
floating-point
units
can
access
to
memory
units
that
are
dedicated
to
groups
of
these
floating-point
units,
but
these
disposal
limitations
in
the
communication
between
the
threads,
so
not
all
threads
can
communicate
with
all
the
remaining
threads.
This
is
a
different
characteristic
from
the
CPU.
A
Whenever
you
create
a
zip
OpenMP
matiee
through
the
program,
all
the
threads
you
create
can
communicate
with
the
rest
of
the
threads.
This
doesn't
happen
on
the
GPU
because
of
somehow
is
an
implication
of
the
complex
memory
design
and
the
complex
hierarchy
that
you
have
the
GPU.
Okay.
So
just
remember
these
things
and
you
will
see
how
we
will
be
introducing
a
few
concepts
so
that
to
see
how
you
can
use
the
GPU
and
the
computer
memory
design
to
accelerate
your
code
in.
A
Despite
of
the
complexity
of
the
hard
work
we
will
do,
will
not
recall
me
to
go
deep
into
the
harbour
details:
oscillator
codes
using
openmp
and
opacity,
okay,
so,
in
contrast
to
the
cpu
model
to
multi-threading,
we
usually
have
one
host
and
one
memory.
So
you
start
your
sequential
code
here.
It
uses
the
memory
to
make
the
computations
and
provide
you
the
result.
If
you
enable
the
code
with
multi
threading
with
openmp,
you
have
multiple
threads
here
running
at
the
same
time,
making
access
to
the
same
memory
to
provide
you
the
result.
A
When
you
want
to
use
the
GPU,
you
need
to
start
your
application.
Your
code
on
the
CPU
either
single
threaded
multi-threaded,
but
it
needs
to
start
on
the
CPU
and
at
some
point
you
need
to
specify
a
region
of
the
code.
I
think
CUDA
terminology,
for
instance,
is
called
a
kernel
that
you
offload
to
the
GPU
of
loading
means
that
a
separate
binary
is
created
for
the
harbor
of
the
GPU
that
that
binary
is
transferred
to
the
device.
A
A
You
have
a
host
and
device
each
one
has
its
own
memory
and
you
need
to
transfer
information
code
and
data
from
the
CPU
to
the
GPU
host
to
device
and
back
from
the
GPU
to
the
CPU
for
the
device
to
the
host
okay,
and
this
is
essentially
what
you
will
specify
when
you
are
OpenMP
and
open
in
CC
capabilities.
You
will
say:
I
want
from
this
code
this
piece
of
code
to
be
offloaded
to
the
device,
and
you
will
need
to
specify
what
data
needs
to
the
transfer.
A
So
we
have
just
said
more
or
less
all
of
these
things,
DP
execution
model
is
post
driven
execution
model.
Remember
that
your
code
will
start
on
the
CPU.
Only
the
part
that
you
specify
that
will
be
offloaded
to
the
GPU
will
be
executed
on
the
GPU,
a
better.
The
result
will
be
transferred
back
to
the
CPU.
That
is
the
code
that
will
provide
you
with
the
results
of
your
silence.
A
Okay,
sequential
code
runs
on
a
conventional
processor
on
the
CPU
of
your
machine,
so
the
computational
intensive
part
of
your
code
need
to
be
transferred
and
accelerated
on
a
GPU,
ok
and
to
maximize
performance
on
the
GPU.
What
you
need
is
to
identify
those
parts
of
your
code
that
consume
most
of
the
execution
time.
This
is
what
is
typically
known
as
hotspots
when
you
do
a
profiling
of
your
application.
How
many
of
you
have
made
a
profiling
over
the
application?
A
A
So
if
you
start
from
scratch
with
a
code
that
you
don't
know,
is
mandatory
that
you
start
with
a
profiling,
so
that
you
will
focus
only
on
in
the
first
time
first
steps
in
those
part
that
consumed
most
of
the
time
one
hour,
the
fourth
in
development
time
that
you
invest,
there
will
provide
you
a
greater,
a
bigger
return
of
investment
that
focusing
on
a
part
of
the
code.
That's
only
5%
of
the
situation
time!
Ok,
that's
why
combining
this
with
profiling
is
is
important.
A
So
once
you
have
these
parts
of
the
code
identified,
what
you
need
to
keep
in
mind
is
more
or
less
these
three
guidelines
transfer
the
data
onto
the
device
and
keep
it
there.
What
that
means
is
transferring
data
from
the
memory
of
the
CPU
to
the
memory.
Cpu
is
the
most
computational
consuming
part
of
your
GPU
accelerated
code.
You
need
to
minimize
it.
If
you
can
transfer
data,
leave
it
there,
'never
transfer
it
back,
just
leave
it
there
and
use
it.
Don't
transfer
it
back.
A
Okay,
so
remember
minimizing
data
transfer
and
leaving
on
the
device,
if
possible,
give
the
device
enough
enough
work.
To
do.
Remember
that
you
have
thousands
of
floating-point
units,
if
you
use
only
10%
of
the
floating-point
unit,
you
are
using
the
GPU,
but
you
are
not
using
90%
of
the
floating-point
units
that
are
available
for
your
goal.
What
this
means
in
general
is
that
you
will
need
to
run
big
problem
sizes
to
take
advantage
of
the
computational
power
of
the
GPU.
Okay
and.
A
So
let's
go
on
and
see
why
you
seen
OpenMP
or
open
ECC,
in
contrast
to
many
other
programming
tools
that
you
can
have
in
the
in
the
ecosystem.
So
first
of
all,
GPUs
have
a
reputation
of
being
very
difficult
to
program
and
they
are
difficult
to
program.
Indeed,
if
you
want
to
program
the
GPU
to
achieve
peak
performance
using
CUDA
using
OpenGL,
you
really
have
to
rewrite
completed
your
code.
Probably
you
have
to
rewrite
your
data
structures.
A
A
Ok,
open
MPV
and
open
ACC
are
here
to
help
us
to
bridge
the
gap,
so
essentially
what
they
provide
us
is
a
set
of
directives,
a
simple
application
program,
interface
that
we
can
incrementally
use
to
a
GPU,
enable
parts
of
our
code
incremental
e,
without
rewriting
the
whole
code,
as
we
have
to
do
in
lower-level
part
I
mean
like
CUDA
or
like
opens
here
open
and
P.
An
open
SEC
are
designed
with
productivity
in
mind.
A
What
this
means
is
that
when
you
create
a
parallel,
creating
a
parallel
version
of
your
code
is
very
time
consuming
it's
complex.
You
need
a
lot
of
expertise
to
the
soul,
so
whenever
you
create
your
parallel
version,
the
question
that
you
have
is
okay.
I
created
it
for
system
one,
but
I
need
to
do
bigger
science
to
run
it
on
system.
Can
I
run
into
system?
Can
I
pour
the
code
to
system
two
so
that
it
runs
and
provides
the
correct
results
so
open,
Edition,
open
impe
have
been
designed
with
portability
in
mind.
A
What
that
means
is
that,
as
you
do
with
your
sequential
code,
to
just
recognize
your
code
with
appropriate
flags
on
a
different
system
and
the
code
should
should
run
okay
and
also
good
readability,
and
remember
that,
if
you
use
have
you
used
MPI,
for
instance,
all
of
you
has
use
MPI
the
sequential
code.
If
you
want
to
create
an
MPI
version,
you
have
recorded
completed
your
code.
A
Maybe
you
can
recognize
some
of
the
loops,
but
you
have
made
MPI
init
them
MPI,
finalize
all
the
data
transfers,
all
the
communications,
so
the
parallel
code
and
the
sequential
code
hardly
resemble
one
another.
Okay,
so
open,
MP
and
open
is
a
I
decided
designed
to
avoid
that
you
can
capture
sequential
code.
You
are
open,
MP,
open,
ACC
capabilities
to
the
pragmas,
but
you
still
have
one
piece
of
code
that
you
need
to
maintain
and
improve
not
two
or
three
separate
codes,
one
of
its
tailored
or
specifically
designed
for
one
parallel
programming.
Okay,.
A
Another
good
thing
about
open
MP
on
open
ACC
is
that
the
abstract
away
many
details
of
the
hardware.
If
your
code
in
MPI,
for
instance,
you
need
to
code
to
program
every
single
communication
between
every
single
pair
of
processor
or
set
of
processes
on
the
GPU.
If
you
crawl
code,
that
alone
level
using
a
library
like
a
CUDA
or
OpenCL,
do
you
need
to
call
whether
they
call
the
third
day
the
current
salon
said
how
they
communicate?
What
did
what
parts
of
the
memory
they
access?
A
What
different
levels
of
the
memory
hierarchy
they're
using
the
robot
memory?
They
share
memory,
the
scratchpad,
the
cache.
You
have
a
very
complex
Hardware
on
the
GPU,
so
you
need
to
be
aware
of
that
when
you
program
at
the
low
level,
so
the
good
things
about
OpenMP
on
open
Issy,
we
will
see
today
many
of
those
details
are
obstructed
away
for
you.
You
don't
need
to
care
about
them
just
to
notice
some
of
them
exist
and
provide
you
openmp
and
openness
to
see
some
ways
to
control
how
to
use
some
of
these
power
features.
A
Ok,
so
a
implication
of
all
of
this
is
that
it
minimizes
the
need
for
code
refactoring
sequential
code
when
you
wrote
your
first
MPI
version
to
record
your
application,
probably
most
of
it
so
opening
piano
privacy
I'll
decide
designed
to
avoid
that.
To
just
add
some
pragmas
enable
some
flux
in
the
compiler
that
converts
those
plasmids
into
a
parallel
code,
and
if
you
don't
want
to
use
them,
they
are
disabled,
the
flag
of
the
compiler,
and
you
will
have
the
original
sequential
code,
no
need
to
have
maintain
different
versions
of
of.
A
Ok
and
OpenMP
open
ACC
support,
C,
C++
and
Fortran,
ok
in
order
to
correctly
set
expectations
that
you
may
have
from
par
word
trainer
tool
at
this
moment.
At
this
moment
we
are
supporting
the
C
programming
language
because
of
some
technical
implications,
but
we
are
working
and
respect
during
this
year
to
have
for
support
for
C++,
especially
for
C,
like
code
within
C++
files,
and
also
we
are
working
on
fortune.
A
So
we
have
forced
result
that
we
will
present
in
IES
see
in
front
for
in
two
weeks
in
in
Germany,
but
we
hope
to
have
some
of
these
four
turns
up
or
by
the
end
of
the
year
by
supercomputing,
but
is
something
that
I
want
to
set
code
read
expectations.
We
are
working
on
it
very
hard,
but
let's
see
how
we
can
we
can
do
it.
So
all
the
example
that
you
will
use
today
are
written
in
the
C
programming.
Language
are
all
of
you
familiar
with
the
C
programming
language,
nor
or
less
yeah.
A
Yes,
all
the
method
about
the
composition
of
the
code
in
patterns.
This
applies
to
any
simple
in
any
programming
language.
It's
independent
of
the
programming
language
that
you're
using
what
is
a
type
2
C
is
the
current
version
of
palaver
trainer
1.0
that
we
have
a
stallion
call
in
curry
and
we
will
be
using
today,
ok,
but
we
will
improve
the
product
so
that
we
can
support
C++
important
in
the
tool.
Ok,.
A
We
can
evaluate
that
as
long
as
you
recall
some
of
your
functions
in
a
silica
style
within
the
C++
code,
you
can
analyze
those
files
with
a
parable,
22.
Ok,
we
have
done
it
in
the
past,
so
we
can
do
it,
but
it's
very
is
very
dependent
on
the
features
of
the
C++
programming
language
that
you
are
using
in
your
code.
We
need
to
evaluate
that.
Ok,.
A
A
A
Okay,
so,
finally,
what
are
openmp
an
open
ECC?
They
are
one
more
method
to
use
the
GPU
and
they
are
designed
as
extensions
to
the
programming
language,
that
is,
the
pragmas
and
directives
extend
C,
C++
or
Fortran.
So,
if
you
have
your
code,
you
have
to
add
the
Centers
of
the
directives
and
the
pragmas
to
add
openmp
opening
capabilities
to
Yoko.
It's
an
extension
to
the
language
is
not
part
of
the
language
itself.
A
It
uses
compiler
directives.
What
that
means
is
that
you
have
the
support
of
a
compiler
that
whenever
you
specified
the
app
the
correct,
pragmas
directives
and
closes,
it
is
the
compiler
that
does
the
hard
work
for
you.
If
you
code
in
MPI,
you
need
to
decide
when
you
expand
the
ranks
when
they
communicate
when
they
finish
so
a1
and
the
end
finishes
execution
is,
do
has
to
decide
and
make
that
implementation.
So
in
open,
MP
and
open
ACC,
we
will
specify
where
a
parallel
region
begins.
A
Where
our
parallel
resonance
and
with
the
support
of
the
compiler,
it
will
generate
a
compiler,
the
code,
the
binary
code,
to
create
the
trends
using
POSIX
threads
and
to
destroy
the
threads
at
the
end
or
the
parallel
region.
You
don't
need
to
worry
about
all
the
complexity
of
using
the
underlying
threaded
library
available
in
the
operating
system.
Ok,
so
this
is
what
compiler
narratives
and
having
a
common
report
of
a
compiler
means
in
openmp.
Analysts
see
both
of
them
use
a
hostess
related
programming
model.
A
Remember
that
it
is
the
cpu
that
Esther
this
equation,
that
controls
this
equation.
We
are
only
offloading
the
most
computational
intensive
parts
to
the
GPU
and
the
CPU
is
waiting
for
the
results
coming
from
the
GPU.
Remember
that
it
is
a
host
driven
execution
model
and
all
of
this,
both
of
them
use
the
concept
of
threat
or
task
so
more
or
less
simplifying
a
lot.
A
You
can
consider
the
abstract
concept
of
task,
and
several
implementations
of
stress
has
processes
some
tests
that
collaborate
to
solve
one
single
problem
in
parallel
to
finish
early
faster
and
to
provide
you
the
same
numerical
result.
Okay
and
again,
they
are
focused
on
on
portability
of
your
code.
A
Would
you
want
your
code
to
be
executed
on
curry,
but
you
also
want
your
code
to
be
executed
on
the
next
machine
that
will
come
to
newark,
and
you
also
want
your
code
to
be
executed
on
your
laptop
or
another
supercomputer
that
you
need
to
use
for
the
purposes
of
your
silence.
Okay,
that's
what
portability
means,
okay,
so
just
to
finish
that
off
to
finish
that
this
part
benefits
and
limitations
of
open
and
P
and
open
SSE
benefits
open
a
piano
policy
are
simple
to
use.
You
will
see
they
are
portable
across
different
systems.
A
Re
compiling
your
code
as
you
do
with
your
sequential
code,
and
they
are
how
we're
independent.
When
you
have
an
opening
pickle,
it
will
run
on
any
multi-threaded
operating
system.
As
long
as
you
have
supporting
the
corresponding
compiler
of
the
open
and
PS
Tanner.
All
of
the
open
is
that
okay
limitations,
as
we
said
before,
we
have
the
advantage
of
making
parallel
programming
more
productive,
faster,
better
use
of
our
time,
but
this
comes
at
the
cost.
The
cost
is
that
you
cannot
control
everything
in
your
program.
A
You
can
only
control
that
those
features
that
are
exposed
in
the
application
program,
interface
of
open,
MP
and
open
easy
C.
If
you
want
to
do
something
different,
then
you
need
to
go
and
use
a
lot
different
tools
like
CUDA
OpenCL,
that
I
designed
to
allow
you
to
control
everything
that
you
can
do
on
the
GPU.
But
for
that
reason
they
are
much
more
compact,
complicated
to
you
and,
of
course,
open.
A
A
A
So
once
you
learn
the
basic
knowledge
once
you
know
how
to
use
the
standards
on
how
to
call
for
the
CPU
using
OpenMP
and
Oprah,
you
see
you
can
begin
to
think
how
to
make
your
parallel
implementation
better
so
that
you
can
increase
incrementally
the
performance
of
your
application
and
at
some
point,
your
GPU
code
will
be
faster
than
the
CPU
code.
Okay,
so,
in
order
to
of
the
automates
performance,
remember
the
GPU!
You
need
to
reduce
data.
That's
number
one.
Priority
avoided
that
transfers.
A
Whenever
you
can
allocate
memory,
that
is
in
the
GPU,
you
see
their
computer
and
avoid
data
transfers
back
and
forth
from
this
between
the
two
memory
systems
of
the
host
and
the
device.
Okay
and
again
for
peak
performance,
you
need
you!
When
you
will,
we
see
papers
or
articles
or
announcements
about
the
computational
power
of
the
GPU
with
many
times
see
applications
and
run
200
times
faster
and
read
our
code
70
times
faster
than
the
real
ago.
How
can
you
achieve
that?
A
You
can
achieve
that
peak
performance,
usually
by
making
a
very
sophisticated
programming
of
the
GPU
so
for
an
apple,
an
average
application
having
a
realistic
performance
of
your
application
being
three
five
ten
times
faster
is
something
that
you
can
consider
a
good
performance
on
the
GPU
without
going
into
the
burden
of
all
of
the
details
of
the
low-level
programming
interfaces
of
CUDA
or
occasion.
Okay,
so
depends
on
the
application.
Again
you
can,
depending
on
their
characteristics,
on
the
patterns
of
your
application.
You
can
even
obtain
higher
speeds,
but
it
pretty
much
depends
on
your
application.
A
A
A
Before
doing
the
demonstration
of
the
tool,
just
let's
do
a
very
fast
review
of
the
pipe
of
the
steps
you
have
to
do
in
order
to
in
general
paralyze
your
code,
in
particular
to
paralyze
your
code
to
execute
it
on
a
GPU.
So
you
begin
remember
profiling,
your
code,
if
you
have
never
done
it,
it
could
be
good,
as
you
follow
one
of
these
courses
to
make
a
simple
profile
into
to
double
check
and
be
sure
that
the
functions
you
are
working
on
are
those
that
really
consumed
most
of
the
situation.
Time.
A
Okay,
that
will
return
you
the
biggest
return
of
investment
of
jerae
forde
in
going
to
the
GPU,
so
identify
the
hot
spots.
Second,
probably
the
most
difficult
part
of
paralyzing
for
any
platform
analyze
your
code
to
discover
parallelism.
Okay,
so
you
need
to
understand
your
code,
as
we
said
here
is
where
the
components
approach,
the
patterns,
approach
that
we
will
be
using,
provides
a
lot
of
value,
and
it's
completely
different
to
other
approaches,
as
you
can
see
in
similar
courses
or
tutorials
or
awesome.
A
So
in
analyze,
for
parallelism
is
what
you
will
see:
the
value
of
understanding.
Your
code
in
terms
of
code
components
next,
once
you
know
the
hot
spots,
the
loops
understand
them
in
terms
of
parallelism,
and
you
say:
okay,
this
loop
can
be
paralyzed.
Then
you
need
to
decide
how
to
implement
that
parallelism.
That's
what
we
say
here:
adding
directives
in
the
using
open
and
pure
operation
see
these
are
implementations
of
the
parallelism.
A
You
have
this
covered
in
the
second
step,
so
implementation
of
paralysing
with
directives
again
in
this
third
step,
a
directive,
palaver
training
will
help
you
to
produce
many
implementations
using
OpenMP
and
operation
of
your
single
cycle.
So
we
will
help
in
these
two
stages.
Mainly
so,
when
you
produce
a
parallel
code,
then
you
need
to
compile
it
and
run
it
and
measure
performance
did
the
performance
increase?
Yes,
is
it
enough
for
me
for
my
problem
stop
and
do
go
to
another
different
stuff?
A
If
not,
then
you
need
to
optimize
your
code,
which
typically
means
improve
data
locality,
minimize
data
transfers
on
the
GPU
and
start
again
profiling
to
see
if
now
they
profiling,
the
hottest
pot
that
you
have
found
before
is
again
and
keeps
on
being
the
most
computation
intensive
part
of
your
code.
Okay.
So
this
is
an
iterative
process
that
you
need
to
repeat.
A
Okay,
so
when
you
go
through
all
of
this,
essentially
what
you
will
find
is
that
you
have
your
code.
You
have
your
hotspot.
You
have
identified
parallelism
to
have
implemented
a
parallel
version
that
runs
faster
than
the
original
code,
so
you
will
be
accelerating
this
part
of
the
code,
but
in
order
to
get
peak
peak
performance,
this
100
speed-up
acceleration.
You
need
to
also
paralyze
all
of
this
sequential
region.
This
will
be
with
the
reporter.
A
Next
talk
to
relative
this
peak
speed
up
speed
ups
in
real
applications;
okay,
so
this
is
essentially
the
effect
of
paradise
in
loops.
So
let's
go
to
the
demonstration.
So
in
the
demonstration
you
will
see
openmp
an
open,
easy
pragmas.
So
what
you
will
see
is
see
code
with
something
some
extensions.
These
extensions
have
the
form
of
preprocessor
pragma,
this
special
symbol
pragma.
What
is
called
a
sentinel
sentinel
identifies
the
family
of
pragmas
that
you
are
using
OpenMP
uses
the
sentinel
OMP
open
acc
uses
the
sentinel
ACC.
A
What
that
means
is
that
after
that,
sentinel
you
have
the
name
of
the
directive.
We
will
use
parallel,
we
will
use
four,
we
will
use
critical,
we
will
use
atomic.
We
will
use
data
different
pragmas
that
by
default
they
have
a
meaning.
They
have
a
behavior
that
is
specified
in
the
standard,
but
that
you
can
modify
using
several
clauses
to
modify
the
default
behavior
of
each
of
the
directives.
Okay,
so
this
is
essentially
what
we
do
we'll
see.
A
New
c
c++
with
a
fault
is
the
thinker's
of
a
problem
important
with
the
sinter's
of
a
special
comment
with
this
dollar
symbol
before
the
Cynthia
okay,
but
the
rest
is
typically
the
same
directive
with
the
closest
to
modify
the
default
behavior,
okay,
I'm
just
finally,
together
started
with
the
demonstration
open,
MP
and
open
it
easy
compilers.
We
have
several
of
them.
Probably
the
most
mature,
open,
ECG
compiler
is
PGI
in
the
market.
A
Do
we
have
a
newer
version
in
curry,
19.4
I
think
we
also
have
a
cream
machine
with
a
great
compiler
that
also
supports
open,
MP
and
open
ACC,
and
also
a
in
DC
and
in
clan.
There's
free,
open
source
compiler.
She
also
have
support
for
open
MP
very
much
to
support
and
also
they
are
pushing
moving
forward
support,
floppy
necesito.
So
in
the
most
recent
versions
of
DCT
compiler,
you
can
also
compile
open,
ACC
plasmas
and
open
MP
promise.
A
A
A
A
It
is
losing
some
part
of
the
screen
okay,
so
this
is
thank
you
write
that
you
will
see
when
you
open
the
tool,
so
you
will
see,
on
the
left
hand,
side
a
project
manager
that
you
have
in
many
developing
source
code
editors
that
you
can
manage
different
project
product
projects
at
the
same
time.
So
here
you
have
the
option
to
select
relation
okay,
the
one
that
you
will
be
using
in
the
afternoon
heat
or
you
can,
even
with
clicking
on
file
open
project.
You
can
open
a
new
project.
A
For
instance,
list
open
I
have
several
projects
here
for
the
demonstration.
Let's
open
the
PI
example,
click
on
tools,
and
then
you
have
this
by
example.
Here.
One
thing
that
is
important
to
note
is
that
the
project
is
essentially
a
directory
in
your
PI
system.
Nothing
else
at
that,
okay!
Well,
we
store
some
hidden
information
that
you
will
see
during
today
that
you
can't
recover
and
take
it
away
with
you
with
all
the
work
that
you
have
done
during
the
practicals
okay.
A
This
is
important
because
many
times
you
will
have
real
codes
which
around
build
system,
completion
system
scripts
to
run
the
code,
so
we
don't
want
to
interfere
in
that
part.
So
you
just
open
the
directory
of
you
having
using
controversion
system
and
get
forget
a
directory
is
what
is
under
version
control
for
us,
it's
a
directory
and
all
the
contents.
What
is
a
project
for
parallel
trainer?
A
So
whenever
you
open
the
project,
the
tool
scans
the
directory
and
provides
you
with
the
contents
of
the
project
in
this
case,
if
you
double-click
on
the
example
called
PI,
you
will
see
the
code
that
we
are
using
and
you
will
see
this
a
special
green
circles
here.
What
this
means
is
that
double-clicking
in
real
time,
parallel
where
technology
has
analyzed
your
code,
heart
phone
call,
the
loop
that
you
have
elapsed
has
checked
that
some
of
the
loops
cannot
be
analyzed
for
some
reason.
You
can
report
on
that,
but
it
will
provide
your
green
circle.
A
If
your
coder
loop
is
a
candidate
for
a
loop
that
you
can
convert,
you
can
paralyze.
Okay,
it
has
doubled.
It
has
checked
that
it
fulfills
a
minimum
set
of
properties.
That
is
to
provide
you
with
information
that
this
is
a
loop
where
you
can
start
to
begin
to
internalize
in
terms
of
parallelism
and
introduce
parallelism.
Okay,
when
you
click
on
these
green
circles,.
A
You
are
open
this
dialogue
in
this
dialogue.
We
will
be
using
in
the
morning
these
three
panelists
here.
Well,
you
can
see
you
can
choose
between
open
MP
or
open
SEC,
CPU
or
GPU
multi-threading,
not
offloading.
Let's
begin
with
a
simple
example
of
open,
MP,
CPU
multi-threading.
What
I
want
to
generate
is
a
multi-threaded
version
of
the
PI
code
to
run
on
a
CPU
that
has
multiple
cores.
Ok,
so
once
you
select
that
you
click
on
parallelize
this
button
here
and
here
it
is,
the
tool
has
analyzed.
A
The
code
has
discovered
the
parallelism
and
her
added
promise
for
you.
These
plasmas
are
correct
and
accelerate
your
code,
okay,
how
the
tool
has
done
this?
Let
me
scroll
up
remember
that
we
have
an
approach
based
on
patterns.
Somehow
the
tool
discover
the
type
of
pattern
that
you
can
find
here
so
in
the
lower
part
of
the
of
the
UI
to
have
three
consoles
one
for
building
compiling
your
code,
one
for
the
execution
once
it
is
compiled
and
you
run
it-
you,
output,
information
in
the
console.
A
You
will
see
it
here
and
finally,
the
parallel
world
console.
This
is
what
parallel
reports
the
messages
of
the
analysis
that
has
been
done
so
here
it
is
saying
at
line
27
the
original
line
27.
It
found
a
scalar
reduction
pattern
where
you
have
a
variable
that
is
somehow
process
using
a
commutative
and
associative
operator.
This
is
everything
you
need
to
know
to
determine
that
this
loop
can
be
situated
safely
in
parallel.
A
Okay,
so
in
the
first
line
in
the
parable
console,
the
tool
will
provide
you
with
the
path
that
has
been
able
to
discover
in
after
la
after
the
break,
we
will
see
their
family
the
set
of
patterns
that
we
have
available
in
the
Torah,
this
woman,
and
you
will
learn
to
recognize
them
after
that,
you
can
see
these
available
policies
and
strategies
for
the
variable
son.
What
this
means
is
that,
once
the
code
has
been
identified,
the
pattern
has
been
identified.
A
The
tool
supports
different
ways
of
implementing
parallel
versions,
using
open,
MP
or
open
ACC
and
different
programming
paradigms.
Okay,
so
you
will
be
able
to
select
which
of
these
implementations.
You
want
to
generate
with
the
tool,
and
this
is
done
automatically
by
the
tool
you
just
have-
to
give
the
appropriate
instructions.
Okay,
so
here
by
default,
it
has
selected
a
study
number
one
scalar
with
action.
We
will
see
that
in
the
act
after
the
break.
A
What
this
means
essentially,
is
that
when
you
have
the
code,
the
tool
has
said:
okay,
it
is
the
loop
that
you
want
to
execute
in
parallel.
Let's
enclose
it
in
a
pragma
that
defines
the
parallel
region.
The
parallel
region
with
the
pragma
OMP
panel
is
saying
here
begins
the
panel
region.
So
far
until
this
moment,
you
only
have
one
thread
at
this
moment.
Different
threads
are
created,
so
all
the
threads
collaborate
until
the
end
of
the
parallel
region.
At
this
point,
all
of
them
are
destroyed
and
only
one
continues.
A
Okay,
this
is
what
parallel
means,
for
what
means
is
that,
in
order
for
you
to
paralyze
a
loop,
you
need
to
divide
the
workload,
the
number
of
iterations
between
different
threads.
If
you
have
ten
iterations
and
two
threads
and
it's
ready
secures
the
whole
set
of
ten
iterations,
you
will
really
run
in
a
different
problem
with
20
Terrans.
That
is
not
what
you
want,
so
you
need
to
divide
the
ten
iterations
among
the
set
of
threads
that
you
have
created.
How
did
you
do
that
in
Olympia,
with
another
pragma
OMP,
with
the
keyboard
for
PCs?
A
What
is
called
work
sharing,
how
to
divide
the
durations
of
the
loop
among
the
threads
okay
and
finally,
you
can
disclose
reduction
it
is.
It
is
saying
that
the
variable
zone
is
implemented
using
as
color
reduction
policies.
We
will
see
in
detail
in
the
afternoon
after
the
break,
how
all
of
this
politician
statuses
behave,
but
what
I
want
you
to
see
right
now
is
that
you
will
be
able
to
choose
between
different
policies
and
strategies
and
finally
provides
you
some
information
about
the
generation
of
the
called
how
the
code
has
been
implemented
upon
it.
A
If
you
don't
know
exactly
what
s
calorie
reduction
means,
the
the
training
also
comes
with
a
knowledge
base
that
will
be
improving
and
growing
with
with
a
different
person
that
we
release.
So
if
you
look
at
this
message
and
you'll
see
this
underlined
text
and
you
click
on
it,
you
will
be
presented
with
a
glossary
of
terms
where
it
explains
what
a
scalar
reduction
is,
and
if
you
want
to
learn
more,
you
can
click
in
some
of
the
glossary
terms
or
learn.
A
More
I
will
provide
you
with
a
more
complete
description
and
with
examples
in
C,
important
of
how
scale
reduction
looks
like
in
the
languages
so
somehow
some
sample
some
part
of
the
important
knowledge
about
the
part
that
you
need
to
learn
for
parallel
programming.
It's
also
available
within
the
tool.
You
don't
need
to
go
anywhere
else
to
find
it.
A
So,
let's
close
that
okay,
so
we
have
dinner
a
to
conversion,
let's
compile
it,
how
do
we
compile
it?
If
you
look
at
these
buttons
here,
this
is
the
Settings
button.
Well,
you
can
specify
the
command.
You
will
use
to
build
your
code,
so
let's
use,
for
instance,
GCC
activating
the
open,
MP
support,
Python,
say
LM.
What
this
command
means
is
that
you
will
be
using
GCC.
You
can
use
PGI
clan,
we
are
not.
You
are
not
tied
to
any
compiler.
It's
just
the
set
of
compiler
that
you
have
available
in
the
system.
A
Cori
has
many
compilers
available.
F,
open
MP
is
the
flag
that
you
use
to
activate
support
for
open,
MP
pragmas.
What
this
means
that
the
compiler
GCC
will
take
this
pragmas
and
we
generate
parallel
code
to
implement
the
semantics
of
the
parallel
program
of
therefore
problem,
all
the
progress
that
you
have
specified
here.
If
you
don't
enable
this
flood,
these
plasmas
will
be
ignored
and
you
will
have
the
question
go
as
simple
as
that.
Okay,
then,
these
are
regular
options,
the
name
of
the
file
and
the
name
of
the
executable.
So
yes,
I.
A
A
Yeah
I
can
do
it
later.
I
have
a
make
file
there
in
the
project,
so
I
prefer
to
use
this
to
introduce
the
command
and
the
flux,
because
many
of
you
is
the
first
time
that
you
used.
You
are
using
open
B,
but
there
is
no
restriction
here.
You
put
the
command
that
you
would
execute
in
this
path
in
the
terminal.
If
you
go
to
the
terminal
and
execute
the
same
command
is
exactly
what
the
tool
is
doing
behind
the
scenes.
Okay,.
A
So
if
we
specify
this
command-
and
now
we
click
on
this
hammer-
button
build
project
here,
it
is
the
code
has
been
compiled
successfully
and
now
we
have
the
executable
generated.
Now
we
want
to
run
it.
How
do
we
run
it?
We
go
again
to
the
Settings
button.
We
select
the
top
run
and
we
put
here
the
run
command.
A
Ok,
so
I
click
on
OK
now,
I
click
on
this
play
button
to
run.
In
the
start,
the
execution
and
in
a
different
console
it
outputs
the
output
on
a
standard
input,
a
standard
output
of
this
equation
of
the
command.
Everything
you
see
here
is
the
send
adjuchas.
You
will
see
in
the
terminal
from
a
terminal
execution,
okay,
so
this
has
run
sequentially.
So
how
can
we
run
this?
In
parallel
using
several
threads?
A
A
Threats
equals
one.
What
what
do
you
see?
What
is
this
is
that
open,
MP
and
open
you
see,
provides
you
with
pragmas
directives,
functions
and
with
environment
variables,
so
the
environment
variable
you
can
control
several
things.
One
of
the
things
you
can
control
is
the
number
of
threads
that
you
will
be
using
in
the
parallel
region.
So
we
first
specify
one
what
will
mean
that
I
have
a
panel
region
with
only
one
thread,
so
sequential
execution.
I
can't
run
it
like
this.
A
A
A
Whatever
you
mean
this
replies
to
your
question,
okay,
yeah
decided
two
mechanisms
that
we
provide
to
control
environment
variables,
three
advanced
port
and
you
can
add
them
here
or
you
can
prepended
to
the
execution
command.
Of
course,
if
you
invoke
a
make
file,
the
make
file
inside
can
set
up
all
the
environment.
I,
remember
levels
I
can
meet
or
use.
A
A
A
A
A
D
A
C
A
A
A
A
No,
we
are
not
providing
a
command-line
interface
at
this
moment.
We
have
a
different
tool
that
we
are
designing.
That
is
called
parallel
where
analyzer
that
will
provide
you
common
line
interface
to
this
kind
of
capabilities.
It
will
be
designer
intended
Prabhas
processing,
compilation
outside
of
the
UI.
It's
something
that
is
working
progress.
A
E
A
A
Okay,
in
this
current
version,
parallel
were
trainer
1.2.
We
can't
discover
opportunities
for
parallelization
in
one
file
at
a
time.
What
that
means
is
that,
if
you
open
the
file,
all
the
functions
that
you
use
in
the
for
loop
are
defined
within
the
same
file.
We
can
analyze
it.
We
can
discover
parallelism,
no
problem
with
that.
What
is
different
is
that
if
you
call
one
function
that
is
defined
in
another
second
file,
second
file
dot,
C.
A
That
is
something
that
is
working
problems
that
we
are
about
to
finish
and
that's
a
feature
that
is
expected
to
come
in
parallel
21.3
that
will.
That
is
a
feature
that
is
needed
for
beacons,
because
you
usually
call
functions
in
one
file
that
are
defined
in
different
files.
So
you
need
to
somehow
analyze
several
files
altogether.
That's
something
that's
working
problems.
We
are
about
to
finish
it.
You
could
probably
come
in
parallel
trainer
version
1.3
at
this
moment,
one
file
at
a
time
we
are
working
on
that.
Yes,.
A
For
the
point
of
view
of
the
analysis
of
parallel
main
is
just
another
function,
so
what
we
do
is
we
analyze
the
code
and
we
try
to
find
a
function.
I
know
the
furniture
that
I
call
from
this
function
to
try
to
discover
opportunity
fertilization.
So,
for
that
point
of
view,
main
is
just
another.
One
main
is
important
when
you
build
the
code
to
in
a
digital,
but
not
for
us
to
do
the
analysis.
A
A
Okay,
one
more
thing
I
want
to
show
you
imagine
that
you
are
working
on
your
project.
You
have
made
this
change.
I
know
you
don't
want
this
implementation.
What
did
you
do?
You
need
to
recover
somehow
during
inauguration?
Okay,
so
in
order
to
facilitate
that
workflow
does
great
work
here.
We
have
to
have
this
angle
here.
You
cannot
see
it,
but
this
is
an
angle
that
when
you
open
it
or
not,
you
can
see
it
when
you
click
on
it.
It
opens
this
part
over
here.
A
What
this
means
is
that
whenever
you
click
on
the
green
button,
green
circle
choose
the
options:
click
on
parallelized
and
the
code
changes.
This
is
changing
your
actual
P
datsuk
profile,
but
before
doing
that,
we
save
a
backup
copy
of
the
code
that
you
had
before
inserting
the
progress
okay.
So
this
appears
here
this
is
kind
of
a
built-in
versioning
system
that
you
have
to
keep.
You
can
maintain
different
parallel
versions
that
are
of
interest
for
you
and
with
your
project.
Okay,
so,
for
instance,
I
can't
take
this.
You
cannot
see
there.
A
A
So
if
I
click,
it
will
ask
me,
are
you
sure
you
want
to
do
this
because
the
tool
is
going
to
replace
the
contents
of
PI
dot
C
with
the
original
version?
So
you
need
to
know
what
are
the
version
that
you
have
here
some
how
to
save
versions
for
your
milestones
as
you
make
progress,
so
the
view
something
that
you
don't
want.
You
can
restore
one
of
these
persons
to
begin
again
in
a
in
a
check
point
this
day.
A
Okay,
what
did
I
do
to
me?
Okay,
let
me
cancel.
We
were
here,
okay,
so,
let's
click
on
their
versions,
and
now
you
have
different
files
here.
Every
single
change
of
paralyzation
that
you
generate
generates
a
backup
copy.
So
now,
if
you
want
to
restore
one
of
these
original
copies
to
this
car,
those
changes
you
can
just
click
on
this
button.
Here
restore
this
version
to
the
editor.
A
You
have
to
confirm
that
passage
will
be
a
bug
right
in
the
contents.
I
say:
okay
and
now
I
have
again
the
same
version
that
I
had
it.
Okay,
these
versions.
You
can
delete
these
versions.
Will
you
click
here?
You
can
confirm
the
deletion
of
the
versions.
Let
me
just
for
the
sake
of
clarity,
delete
all
these
versions
of
tests
with
it.
Yesterday.
A
Okay,
so
the
suggested
workflow
even
original
one
I
can
click
delete
it.
This
is
the
workflow
for
the
practicals
is
as
follows:
click
on
the
hint
circle
generate
openmp,
CPU,
multi-threaded
oppression,
parallel
eyes.
It
will
generate
the
original
one
okay.
So
now
you
can
click
on
it.
If
you
want
to
rename
it
it's
up
to
you,
but
what
you
can
do
is
by
clicking
on
this
button.
A
You
can
save
a
version
explicitly
not
automatically
so
you
say:
ok
I
want
to
create
a
new
version
that
is
PI
using
openmp,
using
the
reduction
close
and
now
I
have
my
version
here
at
this
moment
what
I
can
do
is
restore
the
original
version
and
again
click
on
the
buttons
and
say:
ok,
I
want
a
open,
ACC
GPU
of
loading
version.
I
click
on
paralyze,
and
here
is
your
open
ACC
equivalent
implementation?
Ok,
so
you
can
see
some
similarities
between
open
MP
and
open
ACC
up
to
this
level.
A
Parallel
parallel
same
semantics:
well,
the
region
begins
and
ends
for
and
loop,
more
or
less
same
semantics
for
sharing
the
durations
are
sharing
on
the
threads
production
production.
The
scalar
with
action
pattern
has
been
found,
needs
to
be
computed
as
a
reduction
in
open,
MP
and
open
ACC,
and
then
you
have
some
additional
clauses
that
we
will
see
later.
Ok,
but
you
can
compare
them
as
you
can
learn
from
that,
and
with
this
you
can
generate
all
the
versions
that
you
want.
I
can
generate
by
using
a
cc.
A
Reduction,
don't
see
one
more
thing
we
can
show
at
this
moment.
Okay,
you
have
been
doing
your
practicals.
You
have
been
doing
all
of
this
work
and
the
question
is
now
you're
working
here
you
have
access
to
quarry
to
para
ver
trainer.
The
workshop
finishes
to
go
back
to
your
office.
How
do
you
take
away
all
of
your
work?
Do
you
have
to
lose
all
of
this?
Do
you
have
to
make
copy
and
paste
of
all
of
this?
You
don't
have
to
do
it.
So,
let's
see
how
the
things
are
stored
in
the
file
system.
A
A
Named
dot,
PW
T,
so
under
this,
the
tool
is
where
it
keeps
track
of
all
the
copies
all
the
versions
named
versions
of
each
file
that
you
open
and
you
work
with.
So
this
is
the
quencher
code.
This
is
code
you
can
compile
and
run.
So
if
you
compress
all
this
directory
with
this
hidden
file,
you
have
everything
you
have
you
have
done
during
the
practical.
You
have
to
compress
it
to
take
it
away
with
you.
This
is
exactly
like
control
version
systems
work.
A
They
create
this
hidden
dot,
get
directories
dot,
SVN
directories,
so
we
do
it
in
the
same
way,
so
trust
not
to
interfere
with
with
other
tools,
but
to
make
it
very
portable
you
can
just
compress
it
take
it
away
to
another
file
system.
Even
if
you
have
the
tool
you
open
it
and
it
will
recognize
all
these
hidden
information.
Okay,
what
you
lose
is
all
the
usability
features
that
you
have
in
the
graphical
user
interface.
If
you
don't
have
the
paragrah
trainer,
but
you
have
all
the
words
you
have
them.
A
A
D
A
A
D
D
A
From
from
the
graphical
user
interface,
we
don't
have
the
capability,
because
that's
something
that
we
discuss
internally,
because
these
are
the
kind
of
features
that
you
have
in
professional
in
ideas,
professionals,
code
editors:
you
can
manage
all
the
father
you
have
in
the
file
system
from
the
graphical
user
interface
of
the
Scala
later.
But
here
the
problem
is
that
every
single
user
that
we
talked
to
they
usually
use
different
editors.
They
have,
they
prefer
editor.
They
don't
want
to
change
their
editor
for
development.
A
So
we
have
tried
to
minimize
the
amount
of
features
that
we
have
to
manage
projects
in
the
trainer,
because
the
user
prefers
one
one
Eclipse
other
one
to
the
Creator
other
ones
by
so
every
single
user
developer
has
is
preference
for
a
different
development
environment.
So
here
we
decided
to
minimize
the
amount
of
features
to
manage
the
project,
because
the
user
will
be
doing
that
to
your
preferred
professional
environment.
That's
a
the
people
who
have
got
from.
A
Creating
a
new
project
is
just
creating
a
new
directory
in
the
file
system
again
know
from
the
not
within
the
GUI
again,
for
the
same
reason
that
the
Hat
try
to
explain
before
in
professional
view
is
development
environments.
You
have
all
the
capabilities
in
the
project
manager
to
manage
the
file
system
created
actors,
move
files
around
directories
delete
them,
but
we
decided
not
to
implement
it
in
this
case,
because
this
is
not
at
all
intended.
The
graphical
user
interface
is
not
intended
to
replace
your
prefer
development
environment.
C
A
What
did
you
mean
by
remote
that
you
work
open
the
trainer
locally
to
work
locally
and
at
some
point
you
want
to
remove
the
lunch
on
curry
in
order
to
do
that,
you
just
have
to
set
up
the
appropriate
execution
command
to
transfer
what
you
need
to
through
Islam
through
SSH
to
copy
written.
You
need
to
quarry
and
launch
the
process.
Then
it's
up
to
you
how
to
do
it,
because
independent
organization
of
your
project,
sorry.
A
A
A
A
A
A
A
C
C
A
At
this
moment,
this
is
something
we
discussed
yesterday.
Best
practice,
affordable
programming
recommend
that,
instead
of
using
parallel
for
within
target
that
you
used
team,
distribute
parallel
for
to
give
the
compiler
freedom
to
generate
better
quality
code
for
the
GPU.
This
is
something
that
will
come
in
1.3,
probably,
but
it
is
something
that
we
will
definitely
do
at
this
moment.
You
can
do
it
by
editing
here.
It's
just
these
are
complete
editor,
so
you
can
come
to
the
editor
teams.
A
A
We
will
see
that
in
the
in
the
after
the
break,
why
it's
important
to
specify
teams,
distribute
parallel
on
the
GPU
on
how
to
do
that
in
in
open
ACC
that
they
have
the
equivalent
can
work,
reversion,
notation
and
terminology.
We
will
see
that
later,
okay,
so
yes,
this
is
a
complete
editor,
as
you
have
in
any
other
two.
You
can
modify
save
versions
and
keep
on
working
with
them
and
all
the
verses
you
save
will
be
stored
in
the
file
system.
So
you
can
take
away
all
the
what
you
have
done.
A
The
last
demo
that
is
okay,
what
happens
if
I
want
to
paralyze
this
loop
in
openmp
or
in
open
ACC?
You
seen
a
different
study.
Let's
say
I
want
to
use
atomic
protection,
I
click
on
paralyze
and
compare
these
two
versions.
They
are
exactly.
They
are
correct,
implementations
of
exactly
the
same
original,
sequential
code.
The
difference
is:
how
do
we
implement
this
scalar
reduction
operation?
We
have
several
strategies.
A
The
default
strategies
were
never
it
is
available
in
the
standard
use
the
reduction
clicks,
but
maybe
the
real
there
may
be
situations
where
you
have
a
reduction
operation
that
is
not
supported
by
the
standard.
All
you
want
to
use
instead
of
the
scalar
variable
you
need
to
use.
Arrays
arrays
are
not
supported
for
reduction
operations
in
open,
MP
and
open
if
you
see
in
general,
instead
some
exceptions.
So
in
that
case
you
can
still
generate
parallel
code.
What
you
need
to
do
is
the
rest
of
the
implementation
is
the
same.
A
It's
only
changing
the
way
this
reduction
sum
is
handled
in
these
cases
to
atomic
reduction.
Put
it
so
sorry,
the
reduction
clause
and
in
this
cases,
by
guaranteeing
mutual
exclusion.
When
it's
read
is
computing
and
accessing
the
celebrate
okay,
so
what
I
want
to
show
you
here
is
just
that
using
this
part
of
the
panel.
A
I
will
restore
again
using
this
part
of
the
dialogue
you
can
control
which
of
these
implementation
strategies
you
want
to
implement,
and
the
trainer
will
generate
different
parallel
versions
that
you
can
later
compile,
run
measure
the
performance
and
select
the
one
that
is
faster
on
your
system.
Okay,
so
we
will
do,
will
explore
this
in
detail
in
the
practicals
after
after
lunch,
so
we
will
go
into
details
of
all
of
this
okay,
okay,
so
now,
yes,
I
think
it's
time
to
stop
to
make
a
coffee
break.
Thank
you
so
much
for
being
so
interactive.