►
From YouTube: Intro to GPU: 05 Programming for GPUs with Directives
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
I'm
gonna
talk
about
direct
based
programming,
you've
heard
jack
you
have
heard
max
talking
about
this
is
like
in
the
middle,
it's
not
as
easy
as
using
a
library,
but
also
as
not
as
not
as
hard
as
using
lower
level
programming.
So
first
I
want
to
think
lots
of
people
that
I
used
materials
from
especially
IU
I've
got
permission
to
use
the
invidious
openmp
training
materials.
So
there's
some
slides
and
some
of
T,
especially
the
app
the
hands-on
code,
the
plastic
equation.
A
It
was
Jacobi
silver,
I've
kind
of
follow,
with
their
open
SEC
code
and
I,
convert
it
to
open
MP.
So,
just
like,
we
can
see
some
details
and
there
are
other
Tim
maps
and
Simon's
G
programming
GPU
with
open
MP
tutorial
at
SC,
19
and
there's
my
probably
not
not
read
the
list,
but
other
topics
are
what's
new
in
open,
MP
5
from
my
coke
lab
and
then
Chris
Davies
here
his
slides
have
lots
of
his
performance
data
and
the
Nvidia
bootcamp.
A
The
there's
also
most
recent,
the
OEE
CP
buff
as
well
so
I'm.
What
I'm
going
to
do
today
is
like
not
a
separate
open.
Sec,
followed
by
the
open,
MP
talk,
basically,
I
won't
try
to
like
mix
them
together
and
because
they're,
so
similar
lots
of
concepts
are
equivalent.
So
that's
what
I'm
trying
to
do
here
today,
so
first
CPUs
versus
GPS.
A
A
Ok!
So
here
are
one
sample
open,
MP
code
and
one
simple
open,
ACC
code.
These
are
directives,
I'm
going
to
tell
you
what's
directive
later,
but
they
are
like
three
lines
added
to
each
of
the
original
source
code
and
compiler
may
enjoy
it
if
doesn't
recognize
it
or
not,
enabled
which
opening
key
or
open
ACC
support.
So
these
are
called
directives
and
program,
a
pragma
for
for
c
c++,
dollar
van
for
open
for
fortune,
an
advantageous
for
directive
based
proverbs
terrorism.
Our
first
you
can
do
incremental
programming.
A
You
find
your
hotspot,
you
add
some
directives
and
check
progress,
check,
correctness
and
then
repeat,
it
also
allows
you
to
maintain
single
source,
sequential
and
parallel
programming
use
the
compiler
flag
to
enable
or
disable
and
there's
no
over
major
override
of
your
sequential
code.
It
works
for
both
CPU
and
GPU,
and
it's
very
long
learning
curve,
because
you
still
in
your
familiar
language,
programming,
environment,
C,
C++
or
Fortran,
the
hot
you
do
not
have
to
worry
about
lower-level
hardware
details,
the
compilers
will
hide
it
for
you.
It's
like
multiple.
A
So
it
let's
talk
about
this.
Is
we
call
this
device
execution
model?
Basically,
it's
it's
host
centric
it's
host
here
and
then
they
work
rock
it.
When
you
do
the
offload,
you
know
create
environment
on
the
devices
and
then
it
start
to
map
date
out
there
and
an
awfully
work
to
there
and
when
it's
done
get
data
back
to
the
host
then
can
destroy
the
environment
on
the
device
there's
a
like
CPU,
usually
it
hosts,
and
then
you
can
have
multiple
one
or
multiple
devices.
A
A
Pair
of
wisdom,
D
just
remember:
there's
no
synchronization
among
the
teams
wetting
the
tea
each
team,
you
can
their
synchronization.
So
what
we
do
the
target
is
the
program
our
OMP
target,
and
here
teams
distributed
parallel
for
SMD.
We
recommend
it
to
do
a
whole
bunch
instead
of
separate
the
compilers
will
work.
It
work
it
out
because
each
company,
the
reason
is
that
because
each
compilers
actually
with
three
levels,
they
choose
different
levels
here
to
optima
to
to
paralyze.
So
this
way
you
cover
everything,
here's
a
little
diagram.
How
does
it
mean?
A
So
here
we
see
like
target,
so
you
do
dot.
There's
a
data
outside
we'll
talk
about
data
later
basically
OMP
target,
and
then
the
teams
you
could
have
number
of
teams
here,
but
now
without
before
distribute
everybody
would
do
all
the
same.
You
have
league
of
teams,
so
it
once
read
each
in
each
team
at
this
point
are
the
same:
now
you
have
to
distribute
now
the
stir.
Just
this
loop
is
going
to
be
divided
by
multiple
teams
and
then
each
of
them.
A
When
you
do
parallel
for
Cindy,
then
each
of
the
you
create
more
read
more
vectorization
for
each
team
here
very,
very
similar
to
the
open
sec
type.
So
here
what
open
acc
has
is
instead
of
target,
you
have
parallel
and
then
again
you
would,
you
know,
offload
to
your
device
and
then
here
you
kept
again
well
say:
okay,
CC
can.
But
if
you
don't
at
this
point
just
can
here,
you
will
still
have
that
everybody,
you
procreated
more
when
on
Morgan's,
but
then
they
will
do
much
and
redundant
work
as
well.
A
A
So
what
you
do
is
pair
of
programmer
ACC
parallel
and
here
you
would
do
ACC
loop
instead
of
ACC
loop
can
work
your
vector
or
if
you
have
multiple
loops,
you
would
take
ACC
loop
gain
here
and
then
the
next
in
inner
loop
you
see
ACC
vector
for
more
hands-on
tuning.
You
would
do
that,
but
at
the
beginning
level
just
do
this.
Let
the
compiler
choose
it
for
you.
A
A
So
the
loop-
it's
probably
it's
available
in
open,
MP,
open
ACC,
and
it's
already
in
the
opening
of
open,
MP
5.0
standard.
So
you
will
see
the
implementation
coming
out
soon
and
it'll
make
your
life
a
little
bit
easier
as
well.
This
is
the
very
similar
ways
you
can
do
it
in
the
open
ACC
in
open,
MP,
environment.
A
Okay,
now,
let's
talk
about
the
syntax,
open,
ACC
and
open
MP.
We
talked
about
a
programmer
or
a
dollar
a
pound
ACC
and
for
fortune
you
would
have
ant
and
not
every
end
is
depending
on
what
directory
is
not
every
end
is
required.
Some
of
them
are
optional,
but
for
our
foresee,
there's
no
no
end
part
it's
by
curly,
braces
or
depending
on.
If
it's
one
line
thing
you
don't
need
so
this
wooden
informed
that
decision
you
know
a
directive
with
openness
to
see
or
a
director
with
open,
empty
yeah.
A
A
The
slide
I
borrowed
from
the
open,
ACC
training
materials
like
try
to
see
how
many
cans
are
created
with
the
parallel.
Just
just
to
give
you
some
visual
thing.
So
now
you
add
it
all
without
ads
the
loop,
everybody
does
the
whole
loop.
Now
you
add
it
open
a
CSV
and
a
pragma,
a
CC
loop.
Then
you
create
it
now.
The
loop
is
distributed
and
then
without
and
also
underneath
gang
level
can
work.
A
A
For
come
for,
open,
MP,
I
can
range,
and
there
are
just
three
levels:
teams
and
parallel
for
and
70,
but
look
at
the
list
up
there.
A
bunch
of
them
have
basically
used
only
two
levels:
teams
and
parallel
ignores
MD
and
for
Cray
compilers.
They
used
to
ignore
pair
of
four
four
four
CCE
eight,
but
we
see
ce9.
The
C
is
now
clan.
Our
VN
paste
so
follows
our
VM
protocol.
A
Now
teams
in
parallel,
but
CC
e9
Fortran,
is
still
the
classic
fortune
ignorin
and
the
most
rats,
and
then
the
Intel
and
LLVM
current
11
is
in
under
the
plan.
They
will
try
to
do
all
three
levels.
So
what
we
that's?
What
we
say
we,
let's
let
the
OMG
target
teams
it
should
be,
or
the
parallel
470
the
whole
thing.
There
are
some
caveats
that
some
of
some
of
your
algorithms
might
not
fit
so
well.
You
have
to
separate
them,
or
sometimes
you
want
to
collapse.
There's
like
things
in
the
in-between.
A
A
You
just
want
to
list
some
of
the
hardware
and
software
mapping,
so
if
you're
familiar
with,
say
Kudo
OpenCL
or
the
hard
word,
you
know
what
they
are
at
least
four
open
and
CC
and
open
and
P
you
can
say
again
is
team,
work
is
read
and
vector
is
simply
sort
of
surround,
or
we
could
say
parallel
for
for
the
openmp
and
CUDA.
If
you
learned
that
you
would
have
you
heard
of
sweat,
blog
workers
right
and
WAP,
so
we
said
we
recommend
you
use
just
a
CC
loop,
but
we
want
to
do
it.
A
A
There
so
now
I'm
going
to
just
walk
through
a
laplacian
equation
that
the
Nvidia
people
provided
was
open.
Sec
I
just
want
to
show
you
with
the
open
SEC
and
it
was
open,
MP
and
with
all
different
compilers
and
how
how
we
can
can
just
solve
this
problem.
It's
not
like
extensive,
optimization,
there's
a
more
things
they
can.
You
can
tie
or
can
loop
all
these
things
not
being
applied
just
at
this
level,
there's
a
data
region
being
considered,
but
let's
start
from
beginning
and
all
these
codes
are
also
in
the
in
the
hands-on
session.
A
So
this
is
a
C
code.
I
did
I,
say
this
basic
physical
meaning.
First,
basically,
you
have
like
a
grid
and
do
a
pick
in
iteration
and
then
each
in
the
middle
of
the
grid.
When
the
next
time
step
it's
like
average
of
a
boundary
for
so
then
you
have
the
new
whole
grid,
and
then
you
do
your
error
checking
compare
how
this
one
is
has
been
were
converted
to
the
to
the
last
time.
Isn't
that
either
you
reached
your
maximum
step
or
reach
the
iteration
convergence
criteria?
A
The
problem
is
solved,
so
the
source
code
is
like
why
our
error
is
bigger
than
that
you
have
to
continue
or
if
your
iteration
count
is
still
less
than
your
maximum
you'll
continue.
And
then,
in
that
you
would
do
a
calculation,
stop
checking
error,
checking
error,
and
then
you
do
a
swap
your
the
this
time
step
to
become
the
new
one
become
too
old,
and
then
you
can't
get
calculate
the
new
one
again.
A
Okay,
before
we
go
on
I
want
introduce
to
like
normal
classes.
We
often
use
in
OpenMP,
but
they
are
also.
They
also
apply
to
open
ACC.
So
they
can.
We
can't
we
can't
directly
just
apply
it
to
this
example,
without
worrying
about
like
more
complexities
here.
So
one
is
a
reduction.
One
example
on
the
right
side
is
that
if
you
do,
you
know
average
in
a
loop
and
you
basically
add
them
together
and
basically
do
a
summation.
A
But
this
is
like
not
paralyzed
without
reduction,
because
it's
a
loop
index
dependency,
so
we
have
this
class
reduction
and
then
and
it
basically
it
would.
But
it
was
each
thread
it
would
actually
hold
on.
It's
all
oak
awesome
and
then
add
them
together
by
implementation.
So
we
want
to
use
the
reduction
here,
actually
why
we
need
that
in
that
application
and
then
collapse
if
you
have
multiple
loops
and
when
you
do
the
class,
especially
if
your
outer
loop
or
it's
too
small,
then
with
the
with
the
clap.
A
So
you
can
have
a
much
bigger
iteration
space
to
work
with
that
can
be
distributed
among
threads.
So
we
we
did
that.
Now
we
go
back
to
this
application
here.
You
can
see
that
the
reduction
is
needed
here,
because
this
max
the
error
every
time
you
want
to
do
what
you
get
a
new
era.
You
do
a
max
with
something
else,
just
like
a
reduction
of
max,
so
we
do
that
reduction
and
then
an
original
example.
You
have
like
two
levels
or
SEC
loop
and
then
effort
I
won.
You
can
also
have
a
sec
loop.
A
A
So
this
is
the
one.
We
call
this
the
parallel
implementation
without
anything
else
like
data
yet,
and
let
me
just
introduce
one
concept:
the
CUDA
have
managed
memory.
Just
this
allows
you
to
actually
the
the
compiler
will
help
you
to
manage
data
like
as,
if
now,
the
host
and
and
and
advise
have
shared
the
same
memory
space
that
you
don't
have
to
translate
yourself.
The
compiler
would
do
it
and
there's
now
it'll
save
you
lots
of
data
transfer
back
forward,
stop
it
for
the
open
ACC.
A
A
You
know
your
data
as
if
they're
on
the
same
in
the
same
thing,
physical
space
for
open
in
p5.
First,
you
have
your
hardware
to
support
this.
You
can
check
with
NVIDIA
SMI
or
something
yes,
it's
supported,
then,
once
the
implementation
actually
supports.
This
feature
called
requires
unified,
shared
memory,
and
you
just
add
that
I
don't
think
it's
available
anymore
yeah,
but
yeah.
So
this
is
gonna
be
much
easier.
We
need
to
do
it
next
time
when
it's
on
the
market.
So
now
here,
let's
try
with
this
managed
memory
without
a
wizened.
A
Without
so
the
compile
was
put
PGI
as
PGCC,
and
then
we
wanted
some
optimization
as
fast
level.
A
tesla,
CC
70
and
this
mu
4
gives
you
lots
of
output.
To
tell
to
you
whether
I'm
doing
my
you
know
how
am
I
gonna
generate
my
accelerator
code,
whether
I
am
doing
it
whether
I'm
paralyzing
it.
What
I'm
getting
you
know
somewhere
using
my
vectorization,
is
such
a
what
block
size
of
sweat,
sighs
I'm
using
you
get
all
these,
however,
was
this
pair
of
implementation?
It
immediately
compilation
fails
with
the
C
compiler.
A
The
reason
is
the
C
is
a
dynamically
allocated
and
doesn't
know
the
size.
If
I
treat
use
this
on
a
fortune
on
the
Fortran
implementation,
it
works
to
come
it
compiles
so
now
we
do
it
and
managed
it
doesn't
matter
anymore.
It
doesn't
need
to
know
the
size
and
runs,
and
it's
not
bad
and
memory
copy
is
like
almost
zero
for
this.
If
you
manage
it,
it
there's.
No
management
data
need
to
copy.
A
So
here's
a
few
of
the
data
causes
we
start
introduced
now
without
managed
memory.
We
need
to
add
data
causes
to
the
peril.
So
what
are?
They
are
a
copy,
and
this
one
of
them
is
to
say
I'm
putting
all
if
CC
and
openmp
together,
one
of
SEC
is
called
copy
and
an
opening
piece
called
map.
There's
two
and
two
from
is
next
slide-
is
a
table
to
show
how
they
are
equivalent
and
c.
C++
is
like
starting
and
the
lens,
and
fortune
is
starting
and
ending
indexes.
A
Okay,
so
this
is
the
table.
I
showed
the
copy
copy.
It's
the
equivalent
of
map
to
and
from
and
openmp
to,
and
from
is
actually
the
default,
but
you
can
do
just
two
or
just
from
two
means
from
host
to
device
from
means
to
from
device
to
host
like
map
here,
and
that
and
then
copying
copy
out.
Where
is
map
to
and
map
from
and
create
for
open
ACC
is
Outlook
with
open
MP,
so
you
can
Outlook
on
a
GPU,
but
that's
not
copy
like
treated
as
temporary
data
on
the
GPU
and
there's
a
present.
A
You
can
check
if
it's
already
there,
then
you
can
save
some
time
and
ok.
So
now
we
add
data,
because
you
knew
that
when
you
on
host
that
the
data
is
allocated
on
a
host
and
then
you're
using
it
on
a
GPU,
because
you
do
ACC
peril,
I
prog,
my
ACC
peril,
so
you
need
them
to
be
on
the
device
you
do
copy
for
a
you
know
that
you're
gonna
putting
their
give
them
the
device
initial
data
and
then
when
it's
down
you
want
to
get
it
back
to
the
host.
A
So
you
want
copying
an
out
from
copying
and
out
and
a
new
is
as
well,
because
we
have
our
initialization
data
there.
We,
like
I,
said
we
added
a
reduction
in
the
collapse.
So
so
now
we
have
the
data.
Let's
see
remember.
We
also
have
a
bigger
outer
loop
of
iteration.
Why
Oh
big
loop?
So
this
is
like
data
every
time
in
a
in
a
loop.
A
So
yes,
it's
very,
very
bad,
like
200
200
seconds
before
was
the
manage
data.
It
was
like
about
one
second,
so
the
reason
is
so
you
can
compile
it
says.
Oh,
you
know,
I'm
kind
of
gonna
generate
this
copy.
It's
not
presently
already
there,
so
it
actually
does.
So.
This
is
the
output.
When
you
run
s
around
something
you
get
data,
but
there's
also
a
simple
thing
and
the
prof
or
insists
you
can
run,
which
is
the
Nvidia
provide
profiler
that
you're
gonna
hear
more
about
it
later.
A
Basically,
was
that
thing
ain't
it
you
get
some
output.
That
shows
you
actually
how
much
time,
how
many
times
it's
called
sorry
over
33,000
times
called
to
do
this
CUDA
memory
htd
means
host
the
device,
neither
H
means
device.
The
host
basically
was
for
each
max
with
each
iteration.
It
has
started
a
new
power
region,
I
often
offload
anything,
see
region,
and
then
it
treats
it.
A
So,
let's,
let's
take
like
I
mentioned
earlier,
so
let's
keep
the
data
on
it
as
long
as
possible.
Let's
not
do
it
every
time
in
the
region.
Let's
do
it
outside
of
this
while
loop.
Yes,
because
your
data
is
like
you
every
time
step,
you
reuse,
it
reuse,
it
there's
no
need
to
get
it
out.
So
there's
like
a
CC
data
thing.
You
can
do
map
and
openmp
as
well.
A
A
Okay,
so
this
is
the
open
and
p1
and
I
I
use
fortune,
because
the
fortune
I
actually
use
the
Kreg
compiler,
which
is
faster.
Otherwise,
if
I
show
you
the
client
data,
you
see
three
seconds
but
but
for
the
Kree
Empire,
it's
really
fast.
So
here
is
the
fortunate
one
and
with
the
open,
empty
one
Fortran
and
then
also
the
data
region
outside
of
the
wire
loop.
A
So
we
can
do
better
or
not
exactly,
but
it
depends
on
what
your
application
are.
It's
if
the
the
single,
the
structured
data
directly
requires
you
that
enter
data
and
int
and
and
data
region
enter
for
target
enter
to
exit.
You
need
between
the
same
function.
Call
sometimes
you
know
big
application.
It's
really
hard
to
do
so.
You
unstructured
data
region
so
that
you
can
have
enter
and
exit
to
multiple
times
different
places,
for
example
for
this.
A
A
I
mentioned
the
differences
structures.
You
have
to
express
a
start
and
end
with
single
function.
Unstructured
you
can
multiple
start
of
endpoints.
Can
branch
across
multiple
functions?
Okay,
now,
back
to
this
example
code,
you
now
have
to
enter
data
copy
in
your
initialization
function
and
then
exit
in
your
deallocate
function.
A
A
A
A
Otherwise
this
is
the
open,
SEC
yeah
with
PGI
there's
a
regular
version
and
manage
a
diversion,
regular
version.
Probably
you
can
just
more
if
you
want
to
more
it,
but
but
up
to
now
this
one
is
like
not
fully
optimized.
You
can
do
more
with
each
of
them
with
async
and
no
way
to
all
these
other
things,
but
it's
not
up
there
yet
didn't
do
that.
It's
just
a
data
point
up
to
this
any
questions.
A
Okay,
go
on
so
I
want
to
mention
something
me
doesn't
exist
out
of
an
MP.
This
is
all
ACC
Colonels.
The
Kuro
sister,
like
you
can
just
say,
hey
I,
want
this
hot
spot
region
to
be
on
the
target
device
and
I
just
put
an
ACC
Colonels
there.
Without
doing
anything.
First,
you
need
to
mention
worry
about
some
data,
but
otherwise
even
data
is
like
like
implicit
as
long
as
it
can
manage
to
do
it
out
to
the
data
for
you.
Otherwise,
if
it's
not
safe
and
actually
for
the
C
code,
I
got
runtime
error.
A
So,
basically,
the
the
kernels,
even
if
there's
an
in
the
multiple
loops
and
I'll,
try
to
do
multiple
loops
for
it,
because
this
tree
this
is
a
whole
region
and
I'll
do
whatever
is
safe.
Whatever
you
can
optimize
for
you,
peril
is
basically
it's
more
like
explicit
way
of
the
component
that
the
programmer
tells
hey.
I
want
you
to
do
the
the
offload
our
ization
and
where
are
the
loops
I
wanted
you
to
add,
like
you
do
lots
of
menu?
If
you
don't
put
a
loop,
sometimes
some
comparative
in
the
loop
for
you.
A
Sometimes
I
won't
I
even
found
some
companies
with
the
reduction
for
you.
If
you
don't
say
it,
but
it's
not
recommended
you
should
always
do
as
needed.
If
you
want
to
run
it
was
another
compiler
it'll
fail
for
you,
for
you
forgot
to
do
that
with
kernels.
Your
correctness
is
guaranteed
and
but
usually
with.
Kernels
is
not
as
performed
as
as
the
hand
tuned
versions.
So,
as
a
start,
it's
easy
way
to
do.
A
A
A
Okay,
so
that's
the
example:
I
didn't
use
this
update
directive
in
that
example,
but
sometimes
in
your
code
you
might
do
something
on
device
and
then
you
want
either
another
region,
but
in
between
on
the
host,
you
want
do
some
you
know
exchange.
So
that's
what
update
is
for
there's
a
self
and
device
for
ACC
and
from
and
to
for
oMG
meaning
you
know
whichever
direction
it
is
needed
right.
A
Now
shift
gears
so
introduce
all
these
things
now,
let's
just
say
what
I
open
ACC,
what
OpenMP,
how
they're
in
you
know
community
how
without
available
compilers
and
all
these
other
things,
I
think
for
what
I
said
showed
is
like
they're,
actually
pretty
much
similar
to.
How
do
you,
you
know,
add
to
your
code
right?
It's
it's
more
of
the
things
like
the
real.
Adding
directives
are
not
hard,
it's
word
add
and
what
to
add
have
an
after.
A
A
The
open
appears
like
here
is
a
big
open,
NPC,
Yugi
and
I
open
it
open
it,
piers
lots
of
compilers
and
suppose
lots
of
architectures
as
well
so
see
if
you
open
a
cc
when,
when
I
think
when
target,
when
Titan
was
around
it
started
to,
we
want
something
quicker
to
be
able
to
perform
well
on
the
GPU.
So
when
a
cc
was
like
introduced
in
2011
and
then
what
we
did
is
like
open,
a
cc
would
have
those
features
and
things
and
then
open
NP
with
hey.
This
is
good,
let's
take
it.
A
So
here's
like
features
in
4.0
started
to
do
GPU
target
4.5
is
a
major
in
optimization
for
more
target
things
and
for
5.0.
The
more
features
are
like
to
mention
about
loop
and
unified
man.
Loop
Ont,
loop,
unified
memory,
or
these
things
so
more
and
more
mature
OpenMP
as
well
open
SEC
is
still
the
GPU
programming
is
still
more
mature,
then
and
of
then
open
MP.
At
this
point,
for
especially
for
the
NVIDIA
GPUs.
A
And
here's
an
open,
ACC
resource
page
so
here
for
compiler
wise
available
on
parameter.
We
will
have
PGI
and
GCC
what
GCC
performance,
not
that
that
good,
then
Cray
create
deprecated,
open
ACC
support,
starting
CC
and
I,
and
then
our
other
do
e
systems
and
probably
still
those
because
that's
what
open,
ACC,
the
commercial
or
non-commercial,
bigger,
compilers
available.
A
For
open
MP
again,
we
do
have
our
web
page
list.
The
list
is
very
long.
It's
like
one
third
of
it
is
shown
here
so
for
the
parameter
it
will
be
PGI
because,
like
I
think
Jack
mentioned
we
have
PGI
was
in
with
us
seven
and
NRI
that
develop
the
OpenMP
support
for
the
poor
mother
GPU.
The
timeline
is
also
when
poor
mother
is
here.
We
would
have
an
open,
compile
open
version
released
version,
so
we
will
have
PGI
we
also
and
to
expect
that
CGI
would
leverage
their
open,
MP,
open,
ACC
implementation
expertise.
A
So
this
would
be
a
good
compiler
and
CCE
now
focused
on
telling
more
open,
MP
and
open
ACC
and
clan
is
also
only
for
open
MP,
not
for
10
sec,
GCC
I,
put
GJ
twice
and
plan
friend
is
also
part
of
our
VM
big
effort.
Community
developments
involving
I
think
in
my
VM
PGI
is
what
ray
yeah
yeah
and
Intel
over
there
and
then
for
the
other.
Do
ecosystem
I,
basically
the
same
but
I
added
IBM
and
you
know,
and
and
AMD.
These
are,
for
the
other
audio
ELab
bigger
systems.
A
The
summit
front
here
at
Oak
Ridge
and
the
Aurora
at
Argonne
was
Intel
and
the
IBM
is
not
far
for
our
for
the
frontier
for
the
for
for
summit,
so
there
are
more
opening
P
compilers
available.
This
is
the
slide
concepts
from
the
recent
goe
openmp
pause.
Is
it
oh
like
we
need
to
make
sure
we
would
consider
a
portable
machine
solution
can
target
different
plat
platform
across
vendors.
So
here's
a
list
of
these
timewise
for
that
supports
openmp
I
mentioned
them
all
earlier.
A
Here,
I
want
to
show
you
a
list
of
conversion,
it's
actually
almost
two
hundred
ones.
So
at
this
point
would
say:
conversions
not
hard.
If
you
already
have
open
sec,
is
this
more
of
the
how
your
code
would
apply
to
which
compilers
in
which
which
implementation
scheme
which
compiler
the
maturity
of
this
would
give
you
a
good
performance?
I,
don't
need
to
say
this
I
think.
Basically,
the
can
worker
vector
is
distributed
power
of
four
and
simile
and
then
other
of
the
num
of
the
data
causes.
A
Okay,
I
showed
this
then
I.
Now
another
shift
gear.
Basically
I
want
to
show
you
some
of
the
existing
community
effort.
People
are
doing
with
comparison
porting
to
open
in
P
and
what
the
performance
they're,
seeing
and
I
think
the
biggest
outcome
from
this
is
also
people
are
discovering
lots
of
a
compiler
bugs
in
open,
MP,
implementations
or
people
are
finding.
Things
are
missing
that
we
they
wanted
see.
There
were
ask
for
the
openmp
community
to
implementations,
to
fix
it
or
to
you
know,
include
it
in
the
next
specification
of
this
kind
of
feature
request.
A
A
Pretty
good
this
is
one
of
the
coach
was
out
short
by
the
Oakridge
people.
They
actually
the
actual
major
benchmarks
and
confirm
comparing
with
the
open,
ACC
and
open
MP.
The
green
ones
are
open
and
peace
better.
The
blue
ones
that
open
ACC
is
better.
So
it's
actually
also
the
results
in
2018,
it's
not
most
up-to-date,
but
still
at
that
point
as
it's,
the
right
version
of
opening
Peter
compare
4.5.
A
A
A
Think
I
already
mentioned
this
group
construct
loop
cultures
to
it
to
just
like
open
ACC,
but
also
it
makes
sure
your
loop
is
done.
It
once
exactly
not
like
in
some
other
instances
like
if
in
the
nested
loop,
and
then
you
would
would
actually
give
you
one
run
at
more
than
one
time,
but
at
least
one
this
one
is
like
exactly
once.
A
So
last
few
slices
I
just
want
to
talk
about
best
practices
and
some
recommendations
so
for
our
opening,
CC
and
open
MP
I
think
it
to
offload
it's
basically,
these
applies
to
both
of
them.
You
know
you
want
give
enough
work
for
GPU
to
do
and
compiler
hint
can
give
you
lots
of
information
to
find
out
what
things
are
missing.
What
why
they
are
not
vectorized.
All
these
informations
help
you
there's
a
compiler
has
good
hint.
Api
has
good
hint.
A
Claps
claps
is
good
that
you
can
roger
workspace,
that
class
for
opening
CCTV
32
one
for
32
and,
lastly,
choose
all
different.
Try
out
different
compilers.
I
was
surprised
to
see
actually
the
CCE
compiler
was
performing
so
well
for
that
about
that
code
and
trying
different
compilers,
and
if
it's
not
hard
to
do,
you
can
try
post,
open,
MP
and
open
CC
as
well.
A
So
if
your
code
already
has
open
acc,
especially
coming
from
you,
know
the
the
Oakridge
side
there,
it's
fine
parameter
whose
PGI
open
sec
you
can.
You
can
continue
to
use
it,
maybe
and
at
some
point,
when
you
wanted
to
port
it
to
other,
do
in
machines,
you
can
probably
transfer
it
to
OpenMP
openmp,
but
if
we're
starting
from
new,
if
your
new
nurse
key
user
just
for
continuty
purposes,
is
probably
good,
we
already
been
using
MK.
I
press,
open
and
chief
a
long
time
a
nurse
and
was
open
with
the
PGI
openmp
coming
along.
A
It's
like.
Naturally,
you
could
continue
so
the
performance
wise
week
we
looked
at
that
is
the
it's
on
par
or
catching
up
with
an
open
ACC,
depending
on
which
compiler
implementation,
you're
looking
at
IBM,
seems
to
be
pretty
well
and
CCD
in
some
cases
as
well,
I
think
also,
but
but
all
these
open,
MP
compiler
informations
like
virtually
evolving
and
fixing
and
like
try
studying
starting
to
implement
the
five-o
in
standards.
So
you
will
see
more
and
more
further
quality
versions
in
your
future.