►
From YouTube: Programming platforms: Kokkos, Raja and OpenACC
Description
Rahul Gayatri (LBNL), Sunita Chandrasekaran (University of Delaware) and David Alexander Beckingsale (LLNL) present a panel discussion on Programming platforms: Kokkos, Raja and OpenACC. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Panel Chair: Dossay Oryspayev
A
A
We'll
have
two
parts
main
two
parts
first
part
will
be
presentations.
Many
short
presentations
of
about
five
to
seven
minutes
by
each
of
our
panelists.
Then
we'll
have
about
slightly
more
than
30
minutes
for
discussion
and
q
and
a
session
and
then
all
of
the
audience.
I
encourage
to
actively
ask
your
questions
and
participate
in
the
discussions.
A
You
can
do
this
either
using
the
q
and
a
box
or
you
can
raise
your
hand,
and
I
will
unmute
your
mic
so
that
you
can
directly
ask
question
to
one
of
our
panelists
of
or
all
of
them.
Okay,
that
being
said,
I
would
like
to
first
introduce
rahul
kumar
gayatri.
He
is
a
application
performance
specialist
in
the
apg
group
at
nurse
where
he
works
mainly
on
helping
application
teams
optimize
their
codes
for
the
next
generation
architectures.
A
He
was
a
postdoc
in
the
same
group
prior
to
joining
staff.
Did
his
graduation
from
barcelona
super
computing
center,
where
he
worked
with
the
omps's
programming
model
group?
Okay,
the
whole
floor
is
all
yours.
Thanks,
jose
hello.
B
Everyone
can
everybody
see
my
screen.
A
B
Rahul,
okay,
thank
you
hi,
so
I'll
just
put
on.
B
Earlier
requested,
okay,
so
today
I'll
be
talking
about
openmp
and
openmp4
gpus
in
the
context
of
this
workshop
right
and
the
coco's
programming
model.
So
let
me
start
by
brief
introduction
about
what
I
wanted
to
talk
about.
Openmp,
I'm,
assuming
that
most
of
us
are
pretty
familiar
with
openmp
as
a
framework
and
in.
If
you
want
to
know
more
about
openmp4
gpus.
I
would
encourage
you
to
check
out
chris
daley's
talk
from
yesterday's
performance
and
put
productivity
and
portability
a
panel.
B
I
think
he
gave
an
excellent
summary
of
the
available
open
mp
offload
features
and
the
compilers
that
support
this
currently
and
what
their
status
is
and
performance
in
terms
of
different
benchmarks,
micro
benchmarks
that
he
tested.
B
Whenever
full
motor
is
ready
for
production,
then
there
are
intel
gpus
that
will
be
available
on
arora
and
for
which
intel
compilers
plan
to
support
the
openmp
offload
features
for
those
gpus
and
the
amd
gpus
on
frontier,
for
which
cray
and
amd
both
the
compilers
plan
to
support
these
features
for
those
gpus.
B
So
in
a
sense
we
have,
we
can
think
of
openmp
as
like
this
portable
framework
that
can
take
an
existing
code
with
written
with
those
with
the
offload
directives
and
can
be
run
across
all
the
next
generations,
supercomputers
that
will
be
in
the
doe
space.
B
I
know
the
title
says
performance
portability,
and
I
don't
want
to
get
into
that
discussion
because,
as
we
saw
in
yesterday's
panel,
it
has
like
multiple
viewpoints
and
there
is
no
single
accepted
definition
of
what
performance
portability
is
of
yet,
but
at
least
you
can
be
sure
that
the
there
will
be
some
sort
of
portability
when
you
use
these
openmp
for
offload
directives
across
these
multiple
architectures,
how
performant
they'll
be
and
how
close
to
the
peak
performance
of
the
architecture
they
will
reach.
That's
yet
to
be
seen.
B
But
the
good
thing
about
this
is-
and
in
my
experience
on
working
with
these
directives
and
with
different
compilers,
is
that
the
compiler
developers
are
really
receptive
to
what
to
any
bugs
that
might
have
or
any
feature
requests
that
you
may
have.
B
They
have
open
forums
where
you
can
submit
these
bugs
and
most
of
the
times.
It
is
generally
something
wrong
that
the
programmer
is
doing
rather
than
the
compiler.
But
if
there
is
a
case
when
there
is
a
genuine
compiler
bug,
they
are
very
diligent
in
in
fixing
that
and
releasing
it
with
the
next
compiler
release.
B
So,
in
that
sense,
it's
a
very
active
community
in
terms
of
users
and
in
terms
of
compiler
developers,
some
of
the
advantages
and
disadvantages
of
using
openmp
offload
directors
is
that
first
thing
is,
as
we
all
know,
is
open.
Av
is
relatively
easy
to
use.
It's
not
too
invasive
into
your
code,
like
you
can
just
annotate
blocks
of
code,
saying
this
has
to
this
loop.
B
The
subsequent
loop
has
to
run
in
parallel
and
when
you,
when
these
annotated
directives
have
like
the
target,
keyword
in
them
is
effect
openmp
they
will
be
offloaded
to
whatever
target
accelerator
is
available
right
and
then
it
has
support
for
c
c,
plus
plus
fortran
languages,
which
is
quite
widely
used
in
our
community,
and
there
is
also
active
work
in
having
some
sort
of
an
implementation
for
these
offload
directives.
B
On
the
cpus
in
case,
there
is
no
accelerator
available
where
the
code
is
being
run,
so
there
is
portability
at
least
that's
the
plan
going
forward.
B
The
biggest
drawbacks
that
I
have
seen
as
even
chris
pointed
out
yesterday,
was
that
it
can
that
you
have
to
really
simplify
the
code
when
passing
it
to
the
openmp
compilers
to
get
optimal
performance,
and
especially,
this
is
very
true
in
case
of
c
plus
plus,
where
even
some
sort
of
an
advanced
like
templated
meta
programming,
which
is
supported
by
c
c
plus
11
and
which
even
the
current
specs
says
that
c,
plus
plus
11
is
supported,
is
the
base
language.
B
But
still
you
don't
see
really
performant
as
performance
results,
as
you
would,
if
you
had
really
hand
tuned
each
of
the
template
parameters
right
and
it
requires
most
of
these
best
gpu
programming
practices,
like
you
know,
doing
the
column,
major
access
for
cordless
memory,
access
and
everything
by
hand,
rather
than
depending
on
the
framework
to
to
get
this.
B
For
you
in
that
sense,
cocos
is
a
much
more
advanced
framework
where
it
allows
you
to
to
have
final
grain
control
over
the
code,
as
well
as
doing
a
lot
of
these
performance
tricks
in
in
in
the
back
end
by
itself.
So
for
those
of
you
who
do
not
know
what
cocos
is,
it
is
a
c
plus
plus
based
programming
model
for
writing
portable
app
performance
for
double
applications.
B
It
allows
you
to
expose
abstract
hierarchies
of
parallelism
in
your
code
and
then
it
is
the
job
of
the
framework
to
map
these
hierarchies
onto
whatever
target
architecture.
You
are
running
and
it
it
provides
a
portability,
some
sort
of
portability
across
all
major
hpc
platforms
and
the
way
it
does.
This
is
by
supporting
different
backends.
So
the
backends
that
are
already
available
are
the
serial
backend,
which
is
basically
sequential
code.
The
p
thread
back
end
the
openmp
3.0
back-end
cuda
back-end
coda
uem
back-end.
B
The
fact
that
there
is
already
a
open,
mp3
and
p
p
thread
back-end
implies
that
the
cocos
code
would
already
run
on
almost
most
of
the
available
cpus
for
hpc
that
are
there
and
the
cuda
and
cuda
uvm
back
ends
implies
that
you
can.
You
can
run
the
cocos
code
on
all
the
nvidia
gpus
that
are
available
right
and
then,
apart
from
these
backends,
there's
an
active
development
of
openmp
target
backend.
B
B
What
are
some
of
its
highlights,
as
I
mentioned
earlier,
it
allows
you
to
abstract
away
the
execution
and
memory
spaces
like
it.
It
will
let
you
allocate
a
particular
storage
in
a
particular
memory
space
and
execute
a
particular
given
block
of
code
in
a
particular
memory
space.
B
The
way
it
abstracts
the
memory
space
is
that
it
uses
these
views,
which
class
view
classes,
which
is
basically
an
abstraction
for
multi-dimensional
arrays,
so
that
lets
you
choose
your
memory
layout
like
like
what
type
of
data
layout
you
want
like
whether
you
want
a
row
major
or
a
column,
major
storage,
whether
and
what
type
of,
and
where
do
you
want,
this
default
storage
to
happen
on
which
memory
space
you
want
a
default
storage
to
happen.
B
How
many
one
minute?
Oh
I'm,
okay,
okay,
and
then
it
lets
you
do
this
1d
or
multi-dimensional.
You
know
loop
computations,
where
it
lets
you
tile
the
loops
and
it
divides
the
available
parallelism
into
three
different
hierarchies
which
in
cuda
you
can
simulate
you
can
analogous,
which
is
analogous
to
like
teams
that
unlock
this
to
thread,
blocks
threads
to
thread
x,
dot
y
and
vector
to
thread
x,
dot
x
and
allows
you
to
do
atomic
operations.
B
This
is
a
simple
example
of
of
where
of
how
a
simple,
c,
plus
plus
code
can
be
written
in
cocos.
On
the
left
side,
you
can
see
the
c
plus
plus
code,
where
we
have
an
array,
integer
array
of
ten
elements
and
it's
being
updated
inside
of
for
loop
and
on
the
cocoa
side.
You
can
actually
choose
your
execution
space.
You
know
the
coco's
default
execution.
B
B
Then
this
code
will
now
run
without
any
change
on
opening,
on
opening,
with
open
mp3
back
in
on
cpus
or
cuda
back
and
on
gpus
and
the
hip
back
end
whenever
it
comes
and
sql
back
and
whenever
we
have
start
working
on
it,
and
so
you
don't
need
to
do
anything
else
about
for
portability
right
see.
B
A
Thank
you
rahul,
and
our
next
talk
will
be
given
by
sunita
chandrasekharan.
She
is
an
assistant
professor
with
the
department
of
computer
and
information
sciences
at
the
university
of
delaware.
She
received
her
phd
in
2012
on
tools
and
algorithms
for
high-level
algorithm
mapping
to
fpgas
from
the
school
of
computer
science
and
engineering
ntu
singapore
rich
her
research
spans
high
performance
computing,
parallel
programming,
benchmarking
and
data
science.
Applications
of
interest
include
scientific
domains
such
as
plasma
physics,
biophysics,
solar
physics
and
bioinformatics
sunita.
C
Thank
you
jose.
I
hope
you
all
can
see
the
slides.
A
C
Okay,
thank
you,
everybody
for
joining,
and
you
know
being
able
to
do
this
online,
so
yeah.
So
following
up
with
rahul's
previous,
you
know
talk
on
the
openmp
and
cocos.
This
is
about
open
acc,
where
the
idea
is
again.
It's
a
directive
based
programming
model,
one
of
the
two
directive-based
drawing
model,
the
other
one
being
openmp
and
just
skimming
through
some
of
the
ongoing
things
that
we
have
been
up
to
with
open
acc.
C
This
was
open,
acc
3.0
specification,
which
was
announced
at
sc
last
year,
and
there
is
a
link
to
on
the
slide
that
will
take
you
to.
You
know
more
elaborate
updates
on
what
were
the
new
features
added
to
3.0.
We
have
started
to
work
closely
with
the
base.
C
Languages
to
you
know,
be
able
to
support
some
of
the
important
features,
and
this
is
still
conversation
ongoing
on
the
side
within
the
technical,
open,
acc
technical
committee
to
be
able
to
update
base
languages
to
c18
c
plus
plus
17
for
2018,
with
all
the
you
know,
motor
motivation
to
define
behavior
for
c
plus,
plus
lambdas
added
in
c
plus,
plus
14..
C
There
is
definitely
many
more
things
to
do
with
respect
to
supporting
other
features
in
these
languages,
so
this
is
pretty
much
a
start.
Then
we
also
improved
multi-device
support
through
direct
memory,
copies
and
synchronization
as
another
added
feature.
Prior
to
this,
there
was,
if
in
case,
if
you
wanted
to
copy
data
from
one
gpu
to
another,
we
had
to
copy
to
cpu
and
then
to
gpu.
So
by
enabling
this
improved
multi-device
support,
we
won't
need
to
synchronize
back
and
forth
with
the
cpu.
We
won't
need
to
block
the
cpu
twice
in
case.
C
Similarly,
zero
on
create
two
data
clause
and
expanding
list
of
directives
that
can
support
if
class,
some
more
clarifications
and
cleanup
based
on
user
feedback
and
as
when
we
go
through
the
specification,
as
all
of
you
might
have
experienced,
you
know
you
always
want
to
update
and
clean
up
based
on
the
feedback
that
you
have
received
while
developing
the
features
as
well
as
using
the
features.
C
I
also
wanted
to
step
through
some
of
the
activities
with
respect
to
using
openacc
on
scientific
applications.
Some
of
these
data
are
not
2020
are
not
super
current
but,
for
example,
the
18
of
insight
applications
at
summit
are
from
data
is,
from
november
2019
the
top
five
hpc
intersect.
360
research
is
a
couple
of
years
old
number,
but
gaussian
vasp
ansys
fluent
are
some
of
those
hpc
top
five
top
three
applications
platform
supported.
C
That
is
a
list
there,
but
I
would
also
you
know,
plus
one
to
rahul's
previous
comment
where
you
know
we
could
get
into
the
performance
portability
aspect
and
talk
about
it
for
several
hours
and
we
will
never
come
to
a
consensus,
but
so
that
those
are
a
list
of
targets
that
is
that
currently
open
acc
supports
and
open
acc
applications.
As
you
can
see,
the
trend
has
grown
from
you
know
about
30
applications.
C
All
the
way
through
200
above
and
the
applications
that
are
you
know,
worked
on
at
hackathons,
obviously
also
count
with
the
increase.
The
number
of
different
types
of
domain
science
applications
that
open
acc
is
able
to
support.
C
We
are
also
running
an
openacc
slack
channel,
which
has
grown
over
the
past
couple
of
years,
quite
a
bit,
especially
those
participating
in
gpu
hackathons.
We
invite
them
to
the
slack
channel
where
they
have
basically
used
as
a
stack
overflow
if
you
like,
and
we
have
been
debating
between
you,
know,
keeping
slack
and
moving
to
stack
overflow,
but
bottom
line
is
trying
to
answer
questions
as
of
when
the
users
have
them
and
an
easy
mode
of
communication
to
you
know,
get
them
up
to
speed.
C
C
Ever
since
this
was
released
and
made
you
know
available,
you
can
see
the
total
number
of
downloads
have
steadily
increased
and
I
myself
use
them
as
part
of
my
teaching.
You
know
parallel
computing
or
computer
architecture
courses,
just
like
you
would
use
say.
Gcc
openmp,
for
example,
for
you
know,
parallel
computing
classes.
C
This
is
an
ongoing
effort,
a
bit
more
on
the
gpu
hackathons.
There
are
a
couple
of
varieties
that
any
of
you
could
participate,
boot
camps
and
hackathons.
One
is
a
couple
of
days
event,
the
other
one
is
a
five-day
event
and-
and
you
basically
bring
your
code,
you
bring
your
team,
there
are
mentors
assigned
to
your
team,
it's
two
mentors
per
team
and
just
sit
down,
and
you
know
and
hack
code
and
get
it
to
working
on
the
systems
which
the
particular
hackathon
host
supports
or
whatever.
C
So
this
has,
you
know,
been
very
successful
and
you
can
see
the
plots
as
to
how
the
number
of
participants
number
of
courts
have
been
steadily
increasing.
I
myself
did
one
of
them
in
ud,
back
in
2016,
had
six
teams
and
definitely
the
codes
took
off,
and
it
was
a
nice
way
of
exposing
domain
scientists
to
the
different
ways
you
can
program
either.
You
know
accelerate
and
parallelize
their
code
on
large-scale
systems
and
thanks
to
julia,
we
got
this.
C
I
got
the
slide
from
her
just
literally
just
yesterday,
where
we
have
the
bunch
of
hackathons
coming
up.
So
take
a
look
at
the
gpu
hackathons.org
and
you
would
find
you
know
a
range
of
applications
and
it's
not
just
open
acc.
It's
cuda
open,
acc,
openmp
cocos.
There
are
applications,
there
are
even
sometimes
python.
C
So
if
you're
interested
you
have
a
code
you're
looking
for
help
and
mentorship
to
move
your
code
to
large
scale
systems,
you
have
more
than
one
place
to
go
to
to
participate,
and
several
of
these
hackathons
are
going
on
virtual
right
now,
and
I
hear
it
is
pretty
successful.
So
do
make
use
of
you
know:
gpu
porting
and
the
last
one
I
wanted
to
draw
your
attention
to
is
an
open,
acc
teaching
kit
which
people
like
me.
C
You
know
who
teach
and
I'm
sure
there
are
several
of
you
on
the
call
who
probably
educate
and
teach
you
know
next
generation
workforce,
there's
a
bunch
of
teaching
materials
that
I
work
with
nvidia
to
put
together
and
there
is
darkerization
as
well
available.
Google
slides
available
and
some
codes
for
lab
exercises,
and
these
are
some
of
the
modules
there's
room
for
improvement.
But
there's
something
for
you
to.
You
know
start
off
and
get
your
hands
dirty.
A
Thank
you
sunita,
and
thank
you
for
staying
within
the
time.
Okay,
our
next
programming
model
talk
will
be
given
by
david
alexander
beckinsale.
He
is
a
computer
scientist
in
the
center
for
applied
scientific
computing
at
lawrence,
livermore
national
laboratory.
His
work
focuses
on
programming
abstractions
and
he
is
a
project
lead
for
empire
and
chai,
and
the
core
raja
team
member
david
received
his
phd
computer
science
from
the
university
of
warwick
uk
in
june
2015.
A
A
D
So
if
you're
familiar
with
raj,
you'll
you'll
know
this,
but
it's
a
library
of
c
plus
abstractions
that
allows
you
to
write
single
source,
portable
loop,
kernels
and
so
the
key
idea
here
and
it's
you
know
similar
to
what
cocus
does-
is
to
be
able
to
really
insulate
the
application
source
code
from
any
hardware
or
programming
model
details,
and
it
gives
you
an
extra
layer
of
abstraction
on
top
of
some
of
the
the
other
kind
of
portable
programming
models
like
openmp,
so
raja
supports
a
wide
range
of
application
needs.
D
D
So
in
terms
of
back
ends,
we've
made
a
real
good
progress,
pretty
much
supporting
all
the
current
platforms
as
well
as
being
well
underway
to
supporting
the
machines.
You'll
be
expected
to
run
on
that
are
coming
up
soon,
so
we
have
regular
sequential
loops
some
simile
stuff
openmp
support
both
for
the
cpu
and
for
target
offload
with
openmp
4.5.
D
We
have
a
partial
threading
building
blocks
back
end.
D
So
the
initial
goal
was
to
get
these
big
old
codes
running
efficiently
on
sierra
and
at
the
time
this
porting
effort
was
started.
You
know
a
complete
cuda
rewrite
was
never
going
to
fly
because
it's
not
portable
and
when
you
have
a
million
lines
of
code
you
just
can't
afford
to
maintain
multiple
versions
and
when
the,
when
these
efforts
were
started,
openmp
offload
wasn't
really
viable,
and
I
think
this
was
mentioned
before,
but
the
you're
really
heavily
dependent
on
compiler
support.
D
If
you
want
to
use
something
like
openmp,
and
so
that's
you
know,
wasn't
a
route
that
our
application
customers
really
wanted
to
go
down
and
then,
if
you
look
out
kind
of
into
the
future
platforms
that
are
coming
down,
the
pipeline
are
based
on
gpus
from
different
vendors,
so
really
having
some
abstraction
that
insulates
you
from
this
is
changing.
Technology
is
critical.
D
So
in
terms
of
what
raja
looks
like
at
the
top
here,
we've
got
just
you
know
your
standard
c
loop.
We
introduce
a
few
concepts
that
allow
you
to
write
this
in
a
portable
way.
So
the
first
thing
is
the
kind
of
the
execution
template.
So
here
we
have
for
all
which
is
our
kind
of
simple
loop
api.
D
Then
that's
templated
on
an
execution
policy
which
determines
where
this
code
is
going
to
be
executed
instead
of
passing
in
your
kind
of
loop
bounds,
as
you
know,
just
a
beginning
and
end,
we
provide
these
iteration
space
objects
that
allow
you
to
kind
of
describe
what
you're
going
to
be
iterating
over
and
here
in
this
case,
that's
the
range
segment,
which
is
just
a
contiguous
range
of
indices,
and
then
the
final
piece
is
that,
instead
of
you
know,
writing
your
loop
body
just
in
there.
D
You
turn
that
into
a
lambda
expression,
and
these
the
numbers
here
in
the
bottom
right
are
just
comparing
the
the
speed
ups
of
this
this
loop.
You
can
write
it
obviously
like
natively
in
sequential
openmp
or
cuda,
or
use
the
raja
version,
where
the
only
thing
that
changes
is
the
template
parameter,
and
you
can
see
there
then
you're
taking
the
time
of
the
native
implementation
over
the
time
of
the
roger
we're
pretty
close
to
to
parity
across
those
back
ends.
D
So
the
next
api
is
the
kernel
api,
and
this
is
how
we
describe
you
know
those
more
complex
like
multiple
levels
of
loops,
as
well
as
non-tightly
nested
ones,
and
it's
it's
parameterized
kind
of.
D
In
the
same
way
you
have
the
kernel
function,
that's
templated
on
an
execution
policy,
but
instead
of
a
single
iteration
space
and
a
single
lambda
function,
you
can
actually
pass
in
an
arbitrary
number
of
iteration
spaces
and
an
arbitrary
number
of
lambda
functions,
and
then
the
execution
policy
describes
and
how
those
are
iterated
over
and
the
kind
of
the
order
that
you
want
things
done.
So
this.
E
D
We
support
things
like
loop.
D
You
know
in
the
previous
code
examples
they
just
determine
where
that
loop
is
going
to
run
and
the
data
accessibility
is
the
responsibility
of
the
programmer
and
again
this
goes
back
to
kind
of
the
initial
work
working
with
the
the
code
teams
at
livermore,
where
they
already
had
code
to
manage
their
data
in
the
way
they
wanted
to
right
and
we
weren't
going
to
come
in
and
tell
them
that
they
have
to
to
move
everything
to
some
special.
You
know
raja
managed
data
type.
D
So
the
the
two
projects
that
that
I
work
on
that
are
in
this
space
is
chai,
which
provides
an
array-like
object
that
coordinates
with
raja,
so
that
the
data
moves
back
and
forth
to
the
cpu
and
gpu
implicitly,
depending
on
where
your
raja
loop
is
going
to
run,
and
then
we
also
have
the
umpire
project,
which
is
a
portable
api
for
accessing
different
types
of
memory.
So
you
know
umpire
is
to
kind
of
memory
resources.
D
So
really
what
raja
has
given
us
in
this?
You
know
having
a
an
abstraction
on
top
of
these
various
other
programming
model
technologies
is
to
be
able
to
write
code.
That's
high
performance
on
sierra,
but
that
we
know
is
going
to
be
portable
to
future
platforms
right,
we're
insulating
all
the
the
application
developers,
the
computational
physicists
from
all
this
kind
of
underlying
churn
on
on
programming
models
right.
You
know
where
you've
got
cuda
and
hip
and
sickle.
D
That's
really
not
something
that
you
want.
The
application
developers
to
have
to
deal
with
right
and
one
of
the
things
that
we've
really
focused
on
is
making
it
easy
to
do
the
most
common
use
cases.
So
incrementally
adopting
this,
you
know
one
loop
at
a
time
as
you
gradually
port.
Your
application
is
easy
and
was
a
critical
feature
in
moving
applications
right
onto
sierra,
but
through
some
of
our
apis.
D
So
I
know
historically,
people
have
viewed
the
raja
project
as
as
fairly
livermore
focus,
but
we
really
are
making
an
effort
to
collaborate
and
continue
to
to
develop
in
the
open
and
to
onboard
more
external
users.
So
you
know
I
just
closed
with
saying
that
we
we
really
welcome
users
and
collaborations,
so
we
hope
to
hear
from
from
some
of
you
thanks
very
much.
A
Thank
you,
david
and
I'd
like
to
thank
all
of
our
panelists
on
behalf
of
everyone
for
presenting
this
very
nice
overviews
of
the
different
programming
models
and
joining
being
able
to
join
us
during
this
uneasy
times,
and
with
that
I'd
like
to
start
taking
questions
from
the
attendees
or
any
other
questions
feel
free.
I
mean
the
panelists.
You
can
also
ask,
please
feel
free
to
discuss
some
topics
with
other
panelists
as
well.
A
A
Do
we
have
any
questions?
Okay,
so
looks
like
we
have
a
bit
of
the
first
question:
okay,
so
the
first
question
is
coming
from
one
of
our
organizers
mwazawan.
He
is
asking
a
question
in
this
soup
of
languages,
where
each
claims
of
ease
of
use
and
portability,
if
someone
who
is
new
to
gpu
porting,
might
get
overwhelmed.
A
D
I
would,
I
would
say
that
all
of
us
are
probably
gonna
suggest
the
projects
that
we're
affiliated
with,
and
but
I
mean,
I
think,
that
you've
kind
of
seen
across
these
projects
that
they,
I
think
the
important
thing
is
that
you
pick
something
that
is
going
to
be
portable
right
and
then
the
second
kind
of
thing
to
consider
would
be
what
fits
best
with
your
your
application
that
you
want
to
port
right,
because
if
it's
fortran,
then
cocos
and
raja
are
kind
of
ruled
out.
C
Sunita
here
chiming
in
plus
one
to
what
david
just
said
coming
from
a
standpoint,
you
know
teaching
literally
teaching
students
in
class
about
gpu
reporting.
I
think
my
another
question
would
the
person
who
is
who
is
posing
this
question
is:
what
is
the
person's
background
right?
Are
we
talking
about
somebody
who
is
a
beginner
to
gp
reporting
or
somebody
who's
in
an
advanced
stage
or
an
intermediate
stage?
That
will
also
depend
to
where
you
want
to
start
for
the
to
the
students.
C
I
would.
I
would
rather
throw
a
directive
than
cuda,
for
example,
and
I
can
see
them
starting
to
use
gpus
at
the
end
of
three
hour,
three
three
seven
three
months
of
a
semester,
and
they
did
they
did
projects.
We
did
a
very
basic
open,
mpr
flooring
and
they
did
projects
on
open,
acc
gpu
and
they
were
happy
to
have
used
gpus.
So
I
think
it
would
also
depend
on
you
know
what
kind
of
background
the
person
is
coming
from
in
order
to
use
a
particular
framework
to
begin
gpu
protein.
B
A
B
Hi
this
is
rahul.
I
I
agree
with
what
sunita
and
david
said.
That
first
thing
is:
what
is
the
base
language
that
your
code
is
in,
and
second
thing
is:
how
much
time
do
you
have
to
spend
on
that
like
cocos
and
raja?
Both
are
a
bit
more
intensive
in
the
sense
that
you
will
have
to
spend
a
slightly
longer
time
to
just
use.
B
Just
get
your
code
ready
to
start
using
these
frameworks,
then
something
like
open,
ec
or
openmp,
where
they're
a
bit
more
easier,
initially,
but
then
again,
cocos
and
raja
support,
more
back-ends
than
openvcc
or
opening
people.
So
so
that's
that's
the
choice
like
how
much
time
do
you
have
to
work
on
this.
A
Okay
yeah.
Thank
you
for
answering
this
question
and
let's
get
start,
we
have
several
questions
in
the
line.
So,
let's
get
started
with
the
easy
ones.
So
vincent
says
that
he
is
sorry
that
he
has
missed
some
of
the
last
talk.
What
hardware
back
ends
can
raja
target.
This
is,
I
believe,.
A
D
Yeah
yeah,
I
got
it
okay,
so
we
have
a
sequential,
which
is
you
know,
just
your
standard
loops.
We
have
a
way
to
kind
of
force
the
compiler
to
generate
sim
decode.
Then
we
have
openmp
on
the
cpu
and
on
the
target
with
openmp
4.5
offload.
D
We
have
a
threading
building
blocks
back
end
that
has
partial
support.
We
have
full
support
for
cuda
and
for
hip
and
we
have
a
development
back
end
for
sickle
targeting
the
intel
gpus.
A
Okay,
thank
you
and
if
we
okay,
so
we
have
several
other
questions
lined
up,
so
the
quickest
ones
would
be
to
answer
what
is
the
main
difference
between
open,
acc
and
openmp
anonymous
attendee
is
asking.
C
So
so
the
open,
acc
and
openmp
difference.
That's
an
excellent
question.
B
Okay,
apart
from
the
obvious
yeah
fact
that
openmp
was
like
until
a
couple
of
years,
back
openmp
was
more
sort
of
concentrated
in
in
on
the
cpu
side,
sorry
and
open
ecc
more
on
the
gpu
side,
but
now,
with
the
with
the
rapid
development
of
these
target
directives
by
different
compiler
vendors
for
different
hardware,
I
would
say
like
at
least
for
nvidia
gpus.
Both
open
mp
and
open
acc
kind
of
are
pretty
much
similar,
but
open
ecc
has
like
especially
the
pgi
implementation.
B
Open
ec
has
a
has
been
around
for
a
longer
time,
so
you
in
some
cases
you
might
find
that
it
is
more
optimized
in
its
implementation
compared
to
openmp,
because
mp
is
still
a
bit
new
with
respect
to
that.
But
openmp
is
also,
but
from
my
experience
both
of
them
can
be
used
to
achieve
the
same
sort
of
performance.
B
C
Thank
you
rahul.
I
just
I.
I
thought
I
should
let
you
say
first,
because
obviously
openmp
has
been
existing
for
a
longer
much
longer
time
than
sorry.
Openmp
has
been
existing
for
a
much
longer
time
than
opengcc
right.
So
obviously
open
mp
has
been
you
know,
prevalent
for
cpu
for
many
years,
so
there
are
concepts
like
tasks
in
openmp
where
you
could
probably
do
the
same
with
openacc,
but
it's
a
little
bit
more
convoluted.
C
So
if
you
were
to
have
a
you
know
application
you
want
to
break
down
with
respect
to
tasks,
I
would
use
openmp
with
respect
to
gpu
programming.
Openmp
offloaded
offloading
compilers
are
evolving
as
we
speak.
We
know
they
are
a
priority
for
many
different
reasons
and
there
are
codes
beginning
to
exist.
When
I
say
codes
I
mean
more
than
benchmarks,
you
know
I
mean
real
quotes
in
comparison
to
open
acc,
which
has
been
existing
since
2011
2012
onwards.
C
Implementations
began
to
exist
so
predominantly
targeting
gpus,
and
so
the
adaptation
and
usability
of
open,
acc
features
and
implementations
for
gpus
are
are
more.
You
know
readily
available
for
large
cores
for
production
codes
as
well.
Openmp
is
playing
catch
up
and
pretty
sure
they'll
get
there.
But
if
you
want
to
you
know,
use
move
your
100,
I
mean
tens
and
thousands
of
lines
of
code
to
gpu
open
acc
has
been
there
done
that.
So
that's
my
two
cents.
A
Okay,
thank
you
next
question
that
we'll
take
is
how
do
caucus
and
raja
compare
and
differ.
They
seem
very
similar.
D
I
think
some
of
the
the
main
differences
are
the
first
kind
of
philosophical
writing,
like
I,
I
kind
of
was
getting
at
it
in
my
slides,
where
we
tried
to
make
it
the
simple
thing:
simple
right,
easy
to
to
put
raja
just
on
one
loop,
and
you
don't
have
to
change
any
of
your
data
structures
or
your
memory
management
code
right.
We
don't.
We
don't
want
to
take
control
of
that,
and
I
think
the
other
place
where
there
are
some
differences
are
just
some
of
the
kind
of
specific
features.
D
So
I
think
the
kind
of
the
stuff
we
have
in
the
kernel
api
in
terms
of
what
you
can
do
with
nested
loop
patterns
and
how
you
can
map
them
specifically
to
various
parts
of
the
underlying
programming
model.
Back-End,
are
kind
of
distinct
to
raja
and
potentially,
following
on
from
that,
the
way
that
we
express
execution
policies-
it's
not
in
terms
of
like
this
is
a
parallel
loop.
D
It's
like
I
want
you
to
map
this
loop
specifically
to
you,
know
threads
on
the
gpu
or
blocks
on
the
gpu,
or
something
like
that.
So
we
really
expose
that
control
to
the
user
through
the
interface.
A
Okay,
can
we
okay
so
because
of
the
time,
let's
continue
with
the
next
question,
which
is
along
this
similar
lines?
Jack
is
asking
a
question
on
similar
to
questions
that
came
up
yesterday.
Do
david
and
rahul
see
a
path
towards
getting
the
c
plus
plus
standard
to
adopt
some
of
the
ideas
in
caucus
and
raja.
B
I
would
say
yes,
so
I
a
lot
of
the
advanced
features
that
coco
is,
is
that
is
available
with
cocos
are
actually
being
adopted
in
c
plus
23
standards,
and
I
sometimes
you
can
you
can
imagine
that
scope
was
a
sort
of
a
pre
like
a
testing
bed
for
these
features
and
the
the
things
that
actually
work
out
well
and
then
the
features
that
will
actually
be
beneficial
to
this
language
standard
as
such
are
actually
being
actively
debated
within
the
programming
within
the
sorry
language
committees
to
get
them
adopted
in
the
c
plus
plus
standards.
D
Yeah-
and
I
would
just
add
to
that
real
quick-
we
have
an
ongoing
collaboration
between
the
rogers
and
cocos
teams
to
try
and
come
up
with
some
of
these
features.
So
the
thing
we're
working
on
right
now
is
portable
atomics.
That
would
work
across.
You
know
all
these
different
hardware
back
ends,
but
have
the
the
kind
of
semantics
in
the
api
of
what
will
be
in
the
c
plus
standard.
A
B
B
As
far
as
I
know,
glucose
as
of
now
so
when
you
say
asynchronous
launching,
do
you
mean
launching
a
like
a
block
of
code
and
then
just
going
ahead
and
doing
other
work
or
what
is
here.
E
I
mean
I
I
mean
I
think
coco
is
kernel
launching
or
independent,
but
on
the
other
hand
we
have
to
manage
the
dependencies
using
some
way
like
events
or
synchronized
streams
or
all
those
things
is
there
a.
E
E
B
Yes,
it
was
for
a
different
purpose.
You
might
be
able
to
do
this
with
the
slightly,
but
it's
not
exactly
as
straightforward
as
I
think
your
your
question,
I
mean,
as
you
are
imagining
with
respect
to
this
questionnaire.
D
Yeah,
so
we
have
some
stuff
in
development
right
now.
Actually,
where
and
it's
it's
taken
us
a
while
to
to
figure
out
the
api
to
this,
but
basically
yeah
you'll
have
a
kind
of
some
portable
object.
That
represents
your
you
know:
cuda
stream,
for
example,
and
we're
currently
kind
of
figuring
out
what
the
kernels
will
return,
but
it
will
probably
be
some
kind
of
handle
to
an
event,
but
it's
generic,
so
it
basically
kind
of
gives
you
what
you
say
right
you.
You
still
have
events,
but
it's
not
a
cuda
event
anymore.
D
E
Does
your
effort
coordinates
as
well
like
cpu
tasks,
I
mean,
have
a
con
coherent
environment
handling
both
hosts
and
device
asynchronous,
not
just
purely
rely
on
cuda
runtime
to
do
the
the
tasking
on
the
device
side
I
mean,
but
because
you
not
just
have
the
gtp,
you
also
have
the
host
side
activities
as
well
as
asynchronous
io
or
asynchronous
communication
right.
D
Yeah,
so
I
mean
one:
the
one
of
the
motivating
use
cases
for
what
we're
developing
is
your
kind
of
communication
loop
right
where
you've
got
mpi
stuff
going
on,
and
you
want
to
be
dispatching
messages
as
kernels
are
finishing,
so
it's
not
something
that
we
have
working
right
now,
but
it's
certainly
it's
a
use
case.
That's
driving
this
development.
C
With
openacc,
we
do
offer
its
async
launching
of
kernels,
and
there
are
underlying
acc
runtime
apis
to
be
able
to,
you
know,
manage
under
the
hood
different.
There
are
different
types
of
runtime
apis
that
we
have
used.
C
I
have
matt
is
on
zoom,
sorry
to
put
you
on
spot
matt,
but
feel
free
to
matt
calgary
chime
in.
If
you
want
to
add
more
to
this.
F
Sorry
I
was
half
paid
attention,
so
I
apologize
I
didn't
know
I
was
going
to
be
on
the
spot.
No
opennc
does
have
the
ability
to
do
asynchronous
communication
either
through
the
the
compute
kernels
can
be
launched.
Asynchronously
to
the
host
or
the
data
movement
can
be
launched
asynchronously.
So
the
host
continues
as
the
data
movement
progresses.
So
it's
all
fairly.
F
You
do
have
to
add
an
extra
clause
to
your
directives
by
default
it
will
block,
but
otherwise
it's
it's
in
in
inherent
in
the
the
programming
model.
The
api
to
allow
for
that.
E
F
Yes,
some
implementations
will
use
a
stream
if
you're
doing
cuda
to
do
the
dependencies,
but
it's
really
an
async
number
and
you
can
apply
dependencies
via
the
you.
You
can
actually
do
an
async
weight
where
you
do
a
weight
or
to
me
a
acc
weight
on
one
str
once
a
sync
cue
to
another
and
you
can
create
a
whole
different
dependency
graphs.
Based
on
that,
so
you
could
have
a
one
compute
region
which
waits
on
strip
on
async
cube
one
and
two,
but
then
launched
as
three.
F
A
F
A
Thanks,
that's
nice!
So
let's
continue
so
we
have
about
eight
minutes
till
the
end
of
our
session,
so
lots
of
good
questions
and
some
of
the
questions
are
open-ended
and
and
I'd
like
to
remind
the
audience
that
we
have
breakout
rooms
and
hopefully
our
panelists
and
other
attendee
members
will
be
able
to
join
the
breakout
rooms
for
further
discussions.
A
But
before
we
end
this
session,
I'd
like
to
ask
the
final
question
to
all
of
our
panelists,
so
please
feel
free
to
take
turn
to
answer
each
of
these
parts
or
so
that
you
give
all
other
panelists
a
chance
to
answer.
So
my
question
is
so
what
are
the
lessons
you
as
a
panelist
as
an
expert
that
are
advocating
a
certain
programming
model,
have
learned
from
history
of
the
race
of
shared
memory?
Programming
models?
A
That's
the
first
question
so
to
give
you
a
hint.
So
there
is
a
learning
curve.
There
are
academia
and
community
at
large
that
are
taking
an
adoption
and
not
the
least
support
that
comes
with
it.
Finally,
there
are
domain
scientists
who
also
need
to
adapt,
take
adoption
of
those
programming
models.
For
example,
what
you
have
learned
from
feedback
that
you
obtained
via
downloads,
compiler
inclusions
publications?
A
C
I
can
chime
in
so
it's
a
very
good
question.
Thank
you
for
asking
this.
Having
worked
with
several
domain
scientists,
you
know
belonging
to
many
different
domains.
One
thing
we
have
learned,
as
part
of
my
you
know
group
here
in
ud,
is
profiling.
Profiling
helps
immensely
things
that
you
have.
You
know
thought
that
you
have
optimized
and
moved
beautifully
to.
C
This
is
with
respect
to
open
acc
directives
and
so
profiling
and
re
reprofiling.
You
know
going
back
and
fixing
the
optimizations
and
there
are
tools
like
pcast,
which
I
was
mentioning
in
the
call
yesterday,
which
allows
you
to
look
into
the
accuracy
or
verifications
of
the
ports
between
cpu
and
gpu.
That
also
has
helped
us
with
respect
to
you
know
important
domain
science
codes
where
accuracy
matters.
C
What
else
did
that
answer?
Your
question.
A
C
So
yeah
we
do
talk
about
this
quite
often
within
the
committee,
and
I
think
I
was
trying
to
answer
this
on
the
chat
channel.
I
believe
directors
are
there
for
a
ride.
You
know
eventually.
I
would
think
that-
and
this
is
my
personal
opinion-
would
think
that
things
will
move
towards
space
languages.
C
So
there
are
things.
Directives
cannot
do,
which
is
why
programming
models
like
cuda,
for
example,
or
you
know
even
base
languages-
are
doing
their
fair
share.
So
there
are
things,
directives
cannot
do
and-
and
you
know
instead
of
trying
to
fix
that-
I
think
it
would
be
ideal
to
you
know,
get
the
best
of
all
these
different
worlds
and
put
them
together
and
that
that
would
probably
be
the
base
language
going
forward,
and
probably
directives
won't
be
there.
Ten
years
from
now,
who
knows.
B
So
could
you
just
repeat
the
question
once
more.
A
A
Okay,
so
the
question
general
question
was
that
what
are
the
lessons
learned
from
history
of
the
race
of
shared
memory,
programming
models
in
the
history?
So
most
probably
before,
advocating
something
you
have
looked
back
into
the
history
and
have
seen
this
race
for
shared
memory
programming
models?
There
were
things
that
were
introduced
as
it
is
happening
now,
and
what
are
the
lessons
that
you
have
learned
and
what
were
the
things
that
were
driving
you
to
advocate
the
things
or
the
specifically
the
programming
models
that
you
are
currently
trying
to
push
forward.
B
You
only
so
I
used
openmp
quite
extensively
before
I
started
using
cocoas
right
and-
and
I
agree
with
with
sunita
and
saying
that,
like
these
directive
based
programming,
models
have
a
have
a
limit
in
what
they
can.
They
allow
you
to
expose,
and
that's
one
of
the
reasons
why
I
I
I
like
cocos,
is
that
it
is
it
is.
It
allows
me
to
express
the
underlying
parallelism
in
in
a
much
finer
and
better
way,
along
with
allowing
me
to
to
use
its
it's.
B
Like
a
view
selected,
you
know
the
data
multi-dimensional
data
storage
classes
and
provide
layout
options,
and
things
like
that
which
which,
when
you
do
with
openmp,
you
would
actually
manually
do
this
this
kind
of
thing
and
and
then
you
have
to
change
it
every
time
you
you
actually
go
from
cpus
to
gpus
and
or
vice
versa,
but
using
it
some
sort
of
in
more
involved
frameworks
just
go
closer.
I'm
assuming
I've
never
worked
with
raja,
but
I'm
assuming.
B
It
also
has
the
same
thing
where
it
allows
you
to
do
this
more
fine-grained
control
and
exposing
of
parallelism
more
than
what
a
directive
based
programming
model
can
actually
allow
you,
but
there
is
a
catch
right,
it's
not
as
easy
to
move
to
like
coco,
raja
as
it
is
with
open
empire
of
nisi
steve,
which
is
their
strong
suit.
B
It's
more
easy
to
start
programming
on
a
gpu
with
opencc,
as
against
nathan
mentioned
in
the
previous
question.
If
it's
a
three-month
project
that
you
want
to
do
then
open
acc,
your
opening
is
probably
the
way
to
go,
but
you
might
not
be
able
to
expose
everything
or
get
as
optimal
performance
as
you
can,
with
with
like
a
bit
more
involved
having
models
yeah.
Does
that
answer
your
question.
A
Just
say
thank
you,
so
the
the
next
one
david.
Let's
have
your
answer
and
we
are
ending
towards.
I
mean
we're
reaching
towards
the
end
of
our
time.
D
Yeah
sure
so
that
was
yeah
super
kind
of
in-depth
question.
I
mean
digging
back
into
my
memory
bank.
I
really
my
first
experience
with
this
kind
of
stuff
was
programming
opencl
right,
so
it's
portable,
that's
great,
but
the
one
thing
that
really
sticks
in
my
mind
is,
is
all
the
boilerplate
right
and
then
kind
of
I
moved
to
cuda
and
yeah.
You
get
less
boilerplate,
but
you
can
only
run
your
code
on
on
one
vendor's
hardware.
D
So
really,
I
think
that
it's
clear
that
some
kind
of
portable
approach,
whether
or
not
that's
you
know
some
c
plus
plus
base
thing
like
coco,
raja
or
a
directive
based
approach-
you
know-
is-
is
really
going
to
benefit
you
in
terms
of
the
the
life
time
of
your
application
and
first
for
a
more
kind
of
practical
recommendation.
One
thing
that
that
we've
seen
is
incredibly
useful
is
with
you
know,
with
something
like
raja
or
cocos.
D
You
can
write
your
code
and
then,
depending
on
the
back
end,
you
select
you
can
target
the
gpu
or
the
cpu.
So
what
a
lot
of
our
code
teams
have
done
is
if
they
run
into
a
bug
on
the
gpu
they'll,
take
exactly
the
same
code,
but
then
have
it
be
run
like
with
the
openmp
threaded
backend
on
the
cpu
and
then
that
lev
lets
you
leverage
all
of
the
kind
of
debugging.
D
You
know
and
correctness
tools
that
you
have
on
the
cpu
to
track
down
bugs,
and
then
you
can
go
ahead
and
rebuild
your
code
for
the
gpu
and
and
oftentimes.
You
know
the
the
tools
you
have
access
to
on
the
cpu
are
better
for
addressing
these
kinds
of
issues.
So
I
think
you
know
any
technology
you
choose
being
able
to
run
exactly
the
same
code
in
multiple
locations
can
really
help
in
terms
of
debugging.
Any
problems
that
you
run
into.
A
Okay,
great,
thank
you,
david
rahul
and
sunita
one
more
time
I
mean.
I
think
this
discussion
was,
I
mean
all
of
our
audience
were
very
active
during
these
discussions.