►
From YouTube: Performance and Portability
Description
Christopher Daley (LBNL), John Owens (UC Davis) and Phil Roth (Oak Ridge National Lab) present a Panel Discussion on Performance and Portability. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/
Due to some data loss, this recording is missing the start of the initial discussion. Panel Chair: Muaaz Awan.
A
I'm
a
big
multi-gpu
box
they'd
like
you
to
throw
down
six
figures
and
buy
a
dgx,
but
they've
gone
to
a
huge
amount
of
effort
to
try
to
give
you
good
scalability
within
that.
So
it's
no
longer!
You
know
if
you're
interested
in
the
highest
performance
generally.
What
you
want
to
look
at
today
are
those
sort
of
boxes
and
it
seems
like
the
super
computers
are
moving
in
that
direction
too.
A
Multiple
gpus
in
the
same
box,
but
large
investment
toward
trying
to
make
them
look
like
one
gpu
as
much
as
they
can
so
in
terms
of
new
hardware
capabilities.
You
know,
there's
a
lot
of
things
you
can
point
to
that,
have
been
introduced
to
processors
over
time
that
have
been
here's.
This
general
thing
we'd
like
to
do
in
software
and
we're
adding
hardware
support
to
it
or
here's.
A
A
Only
in
the
last
few
years
has
python
really
started
stepping
up
in
the
gpu
world,
so
nvidia
has
their
rapids
initiative
towards
doing
data
science
and
trying
to
do
everything
on
the
gpu
there's
a
nice
numpy
implementation.
That's
been
done
by
nvidia
research
published
at
super
computing
last
year.
That
gives
good
numpy
speedups
on
gpus,
and
so
this
also
is
a
trend
that
I
think
is
worth
catching
on
to
because
it's
a
very
nice
environment
to
develop
in
python.
A
And
finally,
the
compilers
you
know
llvm
is
this
open
source
compiler,
it's
relatively
straightforward
to
work
with
there's
some
new,
just
just-in-time
kind
of
technologies
for
being
able
to
to
generate
code
just
in
time,
and
they
give
you
a
lot
of
opportunity
for
doing
some
interesting
innovations
on
the
software
side.
A
Three
things
I
want
you
not
to
do
as
you're
developing
gpu
codes,
so
the
first
one
is
do
not
think
of
the
cpu
as
the
main
processor
anymore.
Now
you
want
to
think
about
this
reverse
accelerator
kind
of
model,
because
if
your
code
is
bouncing
back
and
forth
from
cpu
to
gpu
to
cpu
to
gpu
you're
having
to
move
data
a
lot-
and
you
really
really
don't
want
to
move
data
that
is
expensive
and
the
bandwidth
between
cpu
and
gpu
is
rather
modest.
A
But
as
much
as
you
can
limit
the
amount
of
work
per
parallel
grain.
And
so,
if
you
balance
those
things
well,
then
you
have
what
we
call
high
occupancy
that
you're
keeping
the
gpu
busy
all
the
time
and
it's
necessary
to
keep
the
gpu
busy,
because
it
is
absolutely
critical
that
the
gpu
be
able
to
hide
long
latencies
from
the
memory
system
by
having
lots
of
work
to
be
able
to
do,
and
this
is
something
where
profilers
can
help.
But
experience
helps
as
well.
B
Thank
you.
It
was
very
insightful,
so
next
we
have
phil
from
okay
national
lab.
Can
you
please
go
ahead
and
share
your.
C
C
Yes,
I'm
having
to
let
zoom
get
permissions
to.
C
B
A
B
C
C
B
C
All
right,
sorry
about
that,
you
can
take
it
off
my
a
couple
of
minutes
account,
so
my
name's
phil
roth.
I
am
a
member
of
the
oakridge
leadership
computing
facility
as
of
late
2018.
Before
that
I
was
a
member
of
the
future
technologies
group
at
oakridge
national
lab.
We
focused
on
things
like
emerging
technologies
and
gpus
were
one
of
the
things
our
group
worked
on.
C
So
the
the
the
biggest
aspect
of
my
current
position
involves
preparing
codes
or
getting
codes
ready
to
run
well
on
frontier,
as
you
might
expect
our
facilities
next
machine
frontier.
C
There's
lots
of
attention
being
paid
right
now
into
trying
to
to
make
something
that
will
work
well
there,
but
we
have
a
system
on
the
floor
right
now
summit,
not
the
number
one
machine
in
the
world
anymore.
I'm
told
it's
it's
number
two
as
of
last
week
and
but
it's
still
a
very
good
development
platform
for
frontier
and,
as
you
may
know,
the
two
systems
use
different
kinds
of
or
different
vendors
for
their
gpus
and
have
slightly
different
node
architectures,
and
so
on
that
node
diagram
that
I
have
there
up
on
the
right.
C
That
is
an
updated
frontier,
node
diagram.
Earlier
this
year,
cray
now
hpe
has
given
us
the
permissions
to
give
a
little
more
detail
about
what
the
node
organization
is
going
to
look
like.
So,
if
you're
interested
in
what
the
frontier
nodes
are
going
to
look
like,
I
believe
they
also
talked
a
little
bit
about
what
the
nvidia-based
shasta
nodes
are
going
to.
Look
like
too.
C
I'm
also
very
interested,
as
you
might
imagine,
because
of
those
couple
of
systems,
and
also
it
was
made
of
the
point
earlier
in
in
the
talks
today-
that
most
people
that
are
in
the
user
base
of
one
dua
center
are
not
necessarily
limited
to
that
doe
center
they're
they're,
trying
to
run
on
on
systems
that
are
at
different
centers,
both
within
the
doe,
complex
and
outside,
and
so
you
know,
one
of
one
of
my
challenges
in
working
with
applications
is
to
try
to
sometimes
open
people's
minds
to
think
about
doing
things
in
a
way
that's
going
to
potentially
port
well
to
other
systems,
so
that
so
that
involves
looking
at
things
that
you
know
lots
of
people
call
programming
models.
C
I
have
a
problem
with
the
name.
That's
why
I
put
it
in
quotes,
but
one
of
our
primary
ones
at
oak
ridge
to
be
looking
at
for
frontier
is,
is
hip,
which
renee
mentioned
several
times
at
this
morning.
C
The
the
chart
that
you
see
to
the
right
is
a
little
bit
of
an
eye
chart,
but
what
it
shows
is
some
normalized
performance
data
from
a
benchmark
suite
comparing
hip
versions
running
on
summit
to
the
the
cuda
versions
of
the
same
benchmarks,
running
on
summit
and
the
hip
performance
was
99.8,
measured
to
be
99.8
percent
of
the
the
cuda
performance,
but
I'm
also
for
both
my
own
application
development
and
for
professional
interest.
C
I'm
interested
in
the
you
know,
sequels
plus
based
models
where
you
specify
your
code
in
lambdas,
cocos,
raja
db,
c,
plus,
plus
and
sigma.
C
So
I
loved
the
the
presentation
that
we
just
got
a
couple
of
the
recommendations
there
hit
really
close
to
home.
For
me,
what
I
want
to
bring
out
as
recommendations
is
is
to
think
about
your
code.
This
is
to
all
the
application
developers
think
about
your
code
kind
of
at
a
higher
level
and
a
little
bit
independent
of
the
programming
model
that
you
think
you're
going
to
use
to
actually
express
your
code.
C
If
you
like
to
think
in
python,
you
know
think
about
list
comprehensions.
If
you
are,
you
know
a
c
plus
plus
user.
Maybe
you
have
to
think
about
coco's
raja
if
you
have
the
experience,
but
you
might
have
to
get
acquainted
with
things
like
the
algorithms
library
and
then
choose
preferably
something
that's
portable
as
a
programming
model,
and
I
want
to
re
reiterate
the
point
that
was
made
kind
of
more
subtly
earlier
today
about
becoming
conversant
with
the
tools
that
the
systems
have.
C
So
all
the
doe
centers
have
training
all
the
gpu
vendors
have
have
training
materials
available
about
using
tools,
sometimes
the
vendors
themselves.
Their
tools
are
are
more
focused
on
a
particular
area
like,
for
example,
it
might
be
the
single
node
performance
and
then
you
have
to
go.
Look
for
a
a
separate
tool.
That's
going
to
give
you
a
scale
out
picture.
C
There's
some
examples
here
on
the
slide
and
my
experience,
especially
with
those
that
are
more
academic
scale
out
tools,
don't
be
shy
about
giving
them
feedback,
sometimes
they're
frustrated
that
they,
you
know
the
projects
are,
are
trying
to
do
the
best
they
can
in
terms
of
getting
information.
C
Sometimes
they
don't
have
all
the
information,
but
they
certainly
would
love
to
hear
about
people
who
are
trying
to
use
their
codes,
but
are
trying
to
use
their
tools
but
are
running
into
barriers
because
of
it,
and
that's
what
I
have
I
think
now
we
trans
transfer
into
the
question
and
answer
period
right.
B
Yeah,
thank
you
phil,
so
yeah.
So
now
it's
the
start
of
the
question
and
answer
session
or
the
discussion
session.
The
audience
can
ask
your
question
in
a
q,
a
box
or
if
you
want
to
be,
you
know,
ask
a
question
verbally,
you
can
raise
your
hand
and
will
unmute
your
might,
can
allow
you
to
talk
and
if
the
panel
members
have
questions
with
each
other,
that
would
also
be
appreciated.
B
C
C
Yeah,
so
what
can
I
say
about
that?
As
as
jeff
hammond
talked
about,
there's
some
open
source
implementations
there
there
are.
C
There
are
discussions
going
on?
That's
the
best
that
I
think
I
can
say
in
this
venue
for
what
might
be
possible.
One
of
the
things
that
I
am
interested
in
and
play
around
with
again
is
just
professional
personal
research
interest.
Whatever,
with
an
eye
towards
what
you
know,
could
we
do
is
trying
to
see
whether
the
open
source,
open
source
implementations
of
these
things
can
be
made
to
run
on
on
systems
that
they
weren't
necessarily
the
primary
target
for
so,
for
example,
jeff
hammond
mentioned
hipsicle.
I
think
it
was
mentioned
in
this
question.
C
I
guess
was
last
year
I
did
some
work
to
get
hipsicle
running
on
or
at
least
demonstrate
hipsicle
running
on
summit,
which
is
a
you
know,
interesting
to
be
able
to
do
that's
not
dpc,
plus
plus.
So,
if
you're
reliant
on
the
dpc
plus
plus
extensions,
it's
not
going
to
give
you
that,
but
a
basic
sickle
type
operations.
B
Okay,
so
there
are
like
a
couple
of
questions.
Maybe
three
questions
in
the
chat
portion.
Can
all
the
panelists
have
a
look
at
those
starting
from
the
first?
That
seems
interesting.
B
A
Yeah,
this
is
john.
I
I
so
the
the
programming
models
I
know
for
gpus
are
primarily
data,
parallel
based,
as
opposed
to
task
based,
and
so
I
think,
there's
interest
in
doing
task
programming,
but
the
programming
systems
don't
support
that
nearly
as
well
as
they
would,
if
you
did
it
in
a
data,
parallel
style.
So
I
believe
in
theory,
what
you
say
is
a
good
thing
and
there
are
runtimes
that
are
supporting
that
to
some
extent,
but
they
are
not
supported
by
the
vendors
as
much
as
sort
of
the
traditional
way
of
doing
things.
B
Okay,
that's
interesting.
Anyone
else
wants
to
address
the
same
question.
F
Yeah
sure
this
is
chris,
so
this
is
supported
to
some
extent
by
openmp
runtimes
I
mean
you
have
the
map
clause.
In
order
to
perform
the
data
transfers,
then
you
can
also
specify
dependencies
between
openmp
target
regions,
and
these
openmp
target
regions
are
effectively
tasks.
F
B
All
right
phil,
do
you
want
to
say
something
about.
C
Yeah,
I
was
gonna
sorry
I
was
muted,
so
I
was
just
gonna
say
that
you
know
that's
one
of
the
one
of
the
arguments
that
the
dbc
plus
plus
and
sickle
folks
make
for
their
programming
model
is
that
the
runtime
is
handling
the
data
transfer
you're,
describing
the
you're
describing
the
dependencies
and
the
runtime
is
handling
that.
So
in
my
you
know
you
can
you
can
argue
whether
my
experience
is
broad
enough
or
not,
but
in
my
experience,
there's
there's
different
classes
of
people's
perceptions
on
that.
C
Some
people
would
much
rather
be
in
control
of
all
the
data
movement
and
other
people
are
happy
to
let
a
runtime
try
to
automate
that,
and
it's
almost
like
a
pendulum
right.
It
feels
like
it
swings
back
and
forth
where
people
think
that
a
runtime
can
be
smart
enough
and
then
they
see
that
maybe
it
can't
and
then
people
go
back
to
a
more
explicit
data
movement
type.
B
Approach
all
right:
we
have
another
question
in
the
q
and
a
box.
What
version
of
hip
was
used
for
hip
cuda
comparison?
I
have
noticed
much
much
worse
performance
than
than
with
cuda
with
version
3.5,
especially
with
kernel,
launch
overhead
and
impact
on
cpu
performance.
I
think
this
is
more
targeted
towards
phil.
C
Yeah
so,
and
I
was
typing
up
a
response
which
was
part
of
the
reason
I
was
muted,
so
one
of
the
things
that
maybe
didn't
come
out
in
in
the
the
brief
presentation
there
was
both
of
the
both
of
those
versions
of
the
code
were
running
on
the
same
system
and
with
the
way
that
hip
works.
No,
so
these
benchmarks
they're
not
using
other
than
just
very,
very
limited
use
of
hip
laws,
they're,
not
using
anything
other
than
straight
up
hip.
C
So
the
the
kernels
were
actually
implemented
in
hip
or
implemented
in
cuda,
as
opposed
to
making
calls
to
external
libraries.
So
essentially
the
way
that
works
on
a
platform
with
nvidia
gpus
and
which
is
the
way
it's
going
to
work
on
perlmutter.
C
Is
that
that
that
there's
essentially
like
a
translation
that
goes
on
at
compile
time,
and
you
end
up
compiling
your
hip
code
with
the
nvcc
compiler
at
least
that's
the
way
that
it
was
supported
at
the
time
that
I
did
the
the
the
comparison
experiments.
Okay,
so
by
the
time
that
you've
actually
produced
an
executable.
As
far
as
the
system
and
the
runtime
know,
it's
a
cuda
executable,
so
I
didn't
see
the
the
problem
that
that
charles
is
is
bringing
up.
C
B
Okay,
anyone
like
anyone
else
on
our
panel
session,
has
a
response
to
that
or
we
should
move
on.
B
G
Oh,
I
actually
don't
have
a
question.
I
just
have
a
comment,
so
I
think,
even
from
the
cpu
land.
Nowadays
we
have
the
cmd,
so
this
is
kind
of
low
level
data
parallel
paradigm
and
then
at
a
high
level.
We
have
the
course
so
when
naturally,
this
is
quite
similar
to
gpu,
which
you
have
a
one,
gpu
multiple
gpu.
G
On
the
other
hand,
you
have
a
fat
low
level
vector.
So
when
we
program
in
either
systems,
we
have
to
think
about
how
to
utilize
all
the
low
level
data
parallel.
Then,
on
the
other
hand,
on
the
high
level
use
all
sorts
of,
we
should
consider
task
parallel
type
of
things.
The
reason
is,
as
this
device
gets
fatter,
but
they
start
to
get
isolated
and
the
internet.
G
G
You
do
have
to
consider
both
in
your
program
if,
if
your
program
can
first
leverage
the
data
parallel,
if
your
loop
has
a
million
iterations,
do
it
and
probably
that
will
work
pretty
well
for
gpus
for
a
while,
but
as
vendors
scaling
up
the
chips,
they
even
get
more
power,
more
hunger
and
you
to
want
to
strong
scaling
your
problem
to
reduce
the
time
to
solution
and
that
your
single
kernel
could
not
feed
the
the
v3
device.
You
do
have
take
advantage
of
asynchronously
launching
multiple
kernels.
They
get
them
to
occupy
the
device.
G
B
Yeah,
thank
you
for
the
comment.
I
think
it's
very
interesting.
B
All
right,
we
have
a
few
more
questions
in
a
q,
a
box.
So
the
first
question
is:
how
does
number
compare
to
numpy
on
gpu?
I'm
guessing
john
owens
pointed
out
some
python
accelerated
libraries.
Do
you
have
a
comment
on
this.
A
Right
so
I
can.
I
can
point
you
to
two
things
that
I
know,
and
I
hope
other
people
might
comment
and
say:
oh
look
at
this
place.
Otherwise,
so
the
numpy
stuff,
I
know,
is
built
on
top
of
legion
and
it
it's
vegate
leg
ate
and
it
was
published
at
super
computing
last
year.
I'm
not
aware
of
that
being
a
product,
but
the
results
they
got
were
outstanding.
It's
really
neat
work,
the
number
stuff.
A
I
would
look
at
rapids,
so
it's
rapids.a.I
and
I
would
say,
the
python
development
that's
been
happening
at
nvidia
has
been
focused
within
rapids
and
it's
more
on
the
data
science
side.
But
they've
done
a
lot
of
python
work
to
be
able
to.
You
know,
make
gpu
computing
map
onto
that.
So
it's
certainly
mentioned
on
their
web
pages
number.
It's
specific
numbers
specifically.
B
Thank
you,
and
next
we
have,
I
think,
it's
more
of
a
comment.
Handling
data
transfers
by
hand
also
means
handling
the
gpu
memory
by
hand.
This
can
become
hearing.
C
B
C
Yeah
so
yeah,
so
this
is,
I
mean,
I'm
guessing.
This
was
in
response
to
my
comment
and
again
I
I
really
I
really
seem
to
well.
I
really
feel
like
it's
people
fall
in
different
categories.
You
can
argue
that
that
a
person
doesn't
shouldn't
need
to
pay
attention
to
that
level
of
detail,
but
I
can
point
to
examples
of
scientists
that
you
know
you
can
tell
them
hey,
let
let
the
runtime
let
the
compiler,
let
the
whatever
handle
this
and
they
don't
want
to.
They.
C
Don't
trust
that
the
soft
you
know
the
tool
chain
can
actually
handle
that
and
at
some
point
you
know
you
can
sit
there
and
you
can
try
to
show
them
hey
the
the
performance
is
close
enough,
but
if
the
person
is
dead
set
on
not
trusting
that
those
things
can
stay
abreast
as
the
architecture
changes
or
the
underlying
software
changes,
whatever
you
know
and
they've
got
access
to
the
low-level
pieces
and
they
want
to
manage
that
complexity.
B
Okay,
that
makes
sense
so
next
we
have
an
interesting
question
from
kate
clark.
What
do
the
panel
think
of
c,
plus
plus
17
parallelism
and
full
tron
2018?
Do
concurrent
and
evolutions
of
those
going
forward?
Example:
coco's,
feeding
into
c
plus
23,
etc,
is
first
class
language
evolution
going
to
replace
directives
like
openm,
openmp
and
openacc,
or
language
extensions
like
cuda
sickle,
perhaps
saving
extensions
for
specific
optimizations
as
needed.
I
believe
you
can.
The
panelists
can
see
the
question.
A
So
it
could-
and
I
I
looking
forward
to
your
talk-
I
I
think
it
absolutely
could
replace
that,
but
I
think
the
question
is:
how
quickly
does
that
happen
right?
So
you
know,
I
mean
one
of
the
reasons
that
cuda
has
gained
traction
is
that
nvidia
has
been
able
to
make
changes
to
the
language
fairly
quickly
and
bring
in
new
hardware
features
and
to
do
that
in
a
standards.
C
I
guess,
as
I
took
it
from
what
I
read
and
what
I
remember
was
that
directives
were
essentially
a
stop
gap
in
between
getting
something
eventually
accepted
into
the
the
core
language
of
whatever
it
was
that
whatever
the
programming
language
was,
so
I'm
actually,
I
you
know
what
I've
seen
of
the
c
plus
plus
17
parallelism,
I
think
I'll
admit
I
think
more
in
in
c
plus,
plus
than
I
do
any
other
language,
but
so
what
I've
seen
there
has
been
really
exciting.
C
As
someone
who's
a
fan
of
things
like
cocos
or
or
sickle
or
dpc,
plus
plus,
I
think
you
know
we.
We
need
to
see
evidence
that
the
implementations
are
are
going
to
do
well,
but
I
I
trust
that
eventually
they
will
I'm
not
worried
about
that.
F
Yeah
these
base
language
editions
are
important,
but
just
looking
at
fortran,
for
example,
the
fact
is
only
providing
a
do
concurrent
abstraction.
This
is
really
not
enough
for
users
to
implement
all
of
the
parallel
algorithms.
They
want
to
implement.
F
Take
time
before
the
base
language
can
support
all
of
the
parallel
constructs
that
we
need,
and,
as
was
mentioned
earlier
in
order
to
use
the
latest
and
greatest
lower
level
hardware
features
these
proprietary
languages
or
directives
is
the.
D
F
B
Okay,
so
we
have
lori
steffi
who
would
like
to
actually
respond
to
one
of
the
questions,
so
I'm
going
to
enable
you
mike
lori
you're
good
to
go.
D
Okay
yeah,
can
you
hear
me
yep?
Okay,
I
just
saw
one
question
in
chat
and
the
speaker
already
addressed,
but
numpy
versus
number
yeah,
the
numpy
that
most
python
people
are
familiar
with
is
not
gonna
work
out
of
the
box
on
a
gpu,
so
you'll
need
something
specific
and
that
might
be
kupay,
which
looks
like
numpy,
but
it's
kuda
under
the
hood
or
legate,
which
is
what
the
speaker
mentioned,
which
is
legion
under
the
hood.
So
I
I
linked
in
the
q
a
but
yeah.
A
D
Yes,
I
have
so
I've.
I've
used
number
quite
a
bit
and
I
like
it,
but
it
for
most
python
people.
D
It
might
be
a
little
bit
of
a
shock
because
it
looks
more
like
cuda
than
python
and
actually
we're
just
now
getting
a
collaboration
going
with
the
legate
developers,
so
we're
going
to
test
that
out
on
core
gpu.
D
So
I
can
report
about
that
soon
and
yeah
koopa
is
the
easiest
by
far
so
it's
probably
the
best
thing
to
get
started
with
yeah.
I'm
happy
to
talk.
You
talk
more
about
this.
B
Okay,
great-
and
I
think
jeff
hammond
also
has
a
few
responses
to
one
of
the
questions.
It
would
be
great
if
jess
could
we
could
take
a
answer
verbally,
if
I
just
unmuted
your
mic,
do
you
want
to.
E
Oh
okay,
so
sorry
I
I
I
was
just
saying
you
know
for
folks
who
are
interested
in
usm
and
any
other
features
just
I.
My
email
is
easy
to
find
use
either
gmail
or
intel
intel's
preferred,
but
gmail
is
easier
to
find.
Just
let
me
know
and
I'll
and
I'm
I'd
be
really
interested
to
know
what
features
folks
want
to
see
and-
and
I
can
you
know-
introduce
you
to
the
implementers
if,
if
that
would
be
helpful.
C
All
right,
I
got
one
panelist
to
panelists
two
panelists
to
panelists.
First
one
to
owen,
you
made
the
the
comment
about
data
structure
today
and
I
love
the
comment
right.
So
I
I
can't
tell
you
how
frustrated
I
am
with
with
one
of
the
codes
that
I've
been
working
with,
where
we
and
by
by
we
I
mean
me,
because
I
was
involved
in
the
initial
construction
of
this
code.
C
A
A
Sometimes
you
want
to
do
things
heterogeneously
and
that
you
want
to.
You
know,
make
the
common
case
fast
and
try
to
make
that
fully
coalesced
very
good
parallel
access,
even
if
it
means
you're
doing
you
know.
Maybe
things
on
the
periphery
differently,
like
that,
is
often
a
win
to
make
things
heterogeneous,
and
then
you
have
different
execution
paths,
maybe
different
kernels
to
handle
those
things,
but
I
think
we
have
to
start
at
the
beginning
when
we
design
something
and
say:
okay,
how
are
we
going
to
lay
out
the
data?
A
How
are
we
going
to
access
the
data?
What
is
the
you
know?
What
are
those
patterns
and
design
the
computation
around
it,
and
also
to
understand
like
how
we're
passing
data
from
kernel
to
kernel
now
more
and
more,
the
the
gpu
memory
hierarchy
is
allowing
better
and
better.
Caching
like
the
caches
on
ampere
are
much
larger
than
the
ones
on
previous
generations,
and
so
historically,
it
hasn't
been
very
effective
to
do.
A
Caching,
to
do
latency
reduction
and
more
and
more
designing
things
that
use
the
cash
well
is
going
to
be
a
a
good
direction
to
go.
The
other
thing
is
that
user
managed
cache.
The
shared
memory
is
what
nvidia
calls
it
that
you
have
this
little
scratch
pad
of
a
few
dozen
kilobytes
is
really
crucial
like
if
you
can
keep
your
computation
within
a
single
kernel,
then
that
is
a
huge
win
in
terms
of
performance
and
in
terms
of
more
sophisticated
data
structures.
A
There's
this
real
chicken
and
egg
kind
of
problem
where
people
don't
think
in
terms
of
using
more
sophisticated
data
structures,
so
they
don't
write
programs
that
use
them.
So
nobody
has
any
use
cases
for
which
to
design
more
sophisticated
data
structures,
but
I
think
the
future
is
that
we
will
have
a
larger
toolbox
of
data
structures,
and
I
certainly
am
interested
in
hearing
what
those
data
structures
should
be
from
people
that
might
be
used
to
using
them
on
other
platforms.
C
Yeah,
just
to
also
emphasize
that
point
that
you
know
was
made
about
thinking
of
the
the
cpu
as
being
the
offload
engine
and
the
gpu
being
the
core
processor.
You
know.
One
of
the
statistics
we
like
to
throw
around
is
that
on
summit
over
well
is
90.
Some
percent
of
the
capability
comes
from
the
gpus
and
on
frontier,
it's
going
to
be
over
95
percent
of
the
capability
of
the
system.
A
A
It
is
less
expensive
to
do
a
breadth-first
search
on
a
large
graph
than
it
is
to
send
that
graph
from
cpu
to
gpu,
which
is
kind
of
boggling
right
like
a
bfs.
Is
it's
not
the
most
complicated
thing,
but
you've
got
to
touch
everything
in
the
graph,
but
it's
still
faster
than
sending
it
from
cpu
to
gpu.
B
So
there
has
been
a
lot
of
talk
about
performance
and
portability
and
for
some
time
I've
also
seen
the
mention
of
productivity
with
these
two
p's.
So
how
do
the
panelists
think
that
productivity
will
like
be
factored
in
in
in
the
near
future,
because
we
have
this
these
model,
like
roofline
model
for
quantifying
performance
on
different
devices
and
portability,
to
some
extent
we
can
know,
but
how
do
you
factor
in
like
productivity?
B
C
Well,
all
right
I'll
jump
in
there
so
and
I
hate
I'm,
I'm
saying
something
about
back
in
the
old
days,
but
darpa
had
a
a
program
called
what
was
it
called
perks,
p-e-r-c-s
and
one
of
the
things
they
tried
to
do
was
come
up
with
the
definition
of
productivity
and
and
my
recollection
was
there
were
some
interesting
attempts,
but
nobody
actually
felt
like
they
did
a
good
job
of
defining
productivity.
C
So,
yes,
I
know
about
the
p3hpc
efforts
because
I'm
involved
in
them
on
the
organizing
side,
and
but
I
I
struggle
a
little
bit
even
the
portability
side
right.
So
we
have
a
couple
of
people
that
have
have
come
out
with
metrics,
like
john
pennycook
from
intel
has
defined
a
metric
for
what
performance
portability
might
mean,
simon
mcintosh
smith,
who
had
some
exposure
in
one
of
the
earlier
presentations
from
his
work.
They
they
have.
C
So
you
know
the
ideal
thing
would
be
that
we
would
have
models
that
would
show
us
whether
it's
worth
doing
the
effort
to
do
that.
You
know
that
porting
effort
that
you
just
talked
about,
but
I
don't
think
those
models
exist
and
right
now
I
haven't
seen
anything
that
I
think
is
so
promising
that
we're
gonna
get
there
soon.
A
I
just
I
mean
it's:
it's
a
near
impossible
problem
to
solve
to
get
both
performance
and
portability
I
mean,
if
you
you
know,
if
I
was
to
look
back
at
the
opencl
effort.
A
You
know
if
I
felt
like
there
was
a
lot
of
emphasis
on
having
correctness
portability
like
you
could
run
something
on
another
piece
of
hardware
and
it
would
run
correctly,
but
there
was
just
not
really
an
emphasis
from
the
committee
on
how
do
we
build
a
programming
system
that
will
allow
us
to
have
high
performance
across
the
range
and
maybe
that's
impossible,
but
it
just
wasn't
a
focus,
and
you
know
it's
always
a
tension
between
those
things
between
portability
and
performance,
and
you
know
I
will
be
looking
back
in
another
panel
in
10
years
and
see
if
the
current
efforts
have
struck
the
right
balance
there
or
not.
F
I
would
just
say
for
productivity:
I
think
one
measure
is
where
you
can
develop
the
code.
If
you
consider
something
like
directives,
you
could
develop
your
application
with
directives
on
any
platform
on
a
machine,
even
without
a
gpu
so
being
able
to
develop
the
code
anywhere.
That's
obviously
a
productivity
win,
and
also
for
things
like
directive
based
programming.
F
You
do
have
a
longer
term
kind
of
maintainability
win
where
the
compiler
can
generate
code,
that's
kind
of
appropriate
to
the
target
hardware.
At
that
time,
it's
less
down
to
the
programmer
to
to
tune
their
code
for
a
specific
piece
of
hardware
with
the
directives
you
can
get
defer,
control
to
the
compiler
and
allow
the
compiler
to
better
map
it
to
the
hardware.
At
that
time,.
C
I'm
gonna
argue
that
those
nice
features
that
you
mentioned
are
not
limited
to
directives,
they're,
not
specific
to
directives,
because
I
think
about
you
know
something
like
coco
raja.
That's
got
different
back
end,
implementations
and,
and
those
features
you
just
talked
about.
You
know
you
could
say
that
if
you
write
to
that
coco,
raja
interface
and
and
sickle
and
dpc
plus
plus
fall
in
the
same
category
right
there,
they're
acting
as
an
abstraction
layer
and
the
back
end
implementation
is
the
thing
that's
going
to
determine
your
actual
performance.
F
B
F
To
directives,
I'm
thinking
more
in
terms
of
proprietary
languages,
I
know,
for
example,
in
my
early
days
of
trying
to
use
cuda,
it
was
hard
just
to
get
access
on
a
machine
that
had
nvidia
gpus
for
me
to
experiment
with
kudo
there
you
go.
A
That
comes
up
every
year
at
the
nvidia
data
science
summit,
that
everybody
says
I
want
a
laptop
that
has
cuda
on
it
and
everybody
has
macs
in
the
room,
and
so
that's
it's
frustrating.
I
I
agree
productivity
wise.
It's
really
nice
to
be
able
to
work
on
your
laptop.
C
Well,
that's
part
of
the
reason
why
again
you
know
thinking
back
to
that
one
code
I
was
talking
about
where
we're
struggling
with
the
data
structure,
one
of
the
things
that
we
have
adopted
there
is
we're
using
cocos
for
that
code,
and
so
we
can
be
doing
development
on
laptops
all
the
way
up
to
the
big
hpc
machines
with
different
performance
and
and
recognizing
that
there
are
problems
that
you
see
when
you're
targeting
you
know
actual
gpu
hardware
that
don't
necessarily
show
up
if
you're,
using
openmp
on
a
cpu
as
the
back
end.
C
A
So
one
of
the
enormous
strengths
that
I've
had
from
being
a
cuda
developer-
and
my
group
has
had-
is
that
it
just
has
so
many
other
cuda
developers
and
so
we're
often
able
to
leverage
work.
Other
people
have
done,
and
so
as
an
academic
cocos
is
not
something
that's
had.
You
know,
in
my
view,
an
enormous
impact
outside
the
doe.
A
I
I
like
its
focus
and
what
it's
trying
to
do.
How
do
you
enlarge
that
footprint?
Because
that
would
you
know
if
you
had
a
big
presence
outside
the
doe?
It
would
only
help
people
inside
the
doe.
B
Thank
you
very
much
for
your
point
of
view,
so
I
think
the
discussion
is
still
ongoing.
Maybe
we
can
continue
this
during
the
break
in
the
breakout
room.
I
would
like
to
thank
once
again
chris
john
and
philip,
for
participating
very
actively
with
all
the
energy
in
this
session
and
to
all
the
audience
who
who
participated
in
this.