►
From YouTube: Roofline Hackathon 2020 part 3
Description
Demo of Nsight Compute using toy kernels and real HPC codes
A
Presentation
today,
because
this
effort
of
doing
replay,
analysis
and
insight
compute,
which
I'll
talk
to
you
about
for
the
next
hour
and
a
half
or
so,
is
largely
the
result
of
a
co-design
effort
between
doe
and
nvidia,
specifically
a
request
from
lbl
and
in
particular
sam
and
charlene,
who
are
helping
do
the
session
today,
where
they
pushed
for
us
to
do
this.
Add
this
functionality
to
insect
compute,
and
so
I'm
happy
to
report
that,
as
of
the
cuda
11,
release
events
that
compute,
which
is
version
2020.1.
A
We
are
now
able
to
do
refine
analysis
in
insight
compute.
Before
I
talk
about
that,
I
want
to
give
you
a
brief
overview
of
nvidia's
developer
tools
for
profiling,
to
give
you
a
sense
of
the
landscape,
of
what
tools
we
use
now
and
how
roofline
analysis
might
fit
into
that.
A
So
the
set
of
profiling
tools
that
nvidia
provides
go
under
the
nsight
product
family
name
in
particular,
we're
going
to
focus
on
insight
systems
and
insight.
Compute
I'll
emphasize
that
these
are
not
the
only
developer
tools
that
nvidia
provides.
A
There
are
also,
for
example,
debugging
tools
like
cuda
gb,
which
is
a
cuda
extension
of
gdb
and
can
be
used
for
debugging
applications
that
run
on
nvidia
gpus,
as
well
as
cuda
memcheck
and
the
new
compute
sanitizer,
which
are
vaguely
analogous
to
what
you
might
use
valgrind
for
for
a
cpu
application,
and
we
also
work
closely
with
the
third-party
tools
ecosystem.
So
tools,
like
you,
know,
hpc
toolkit,
total
view
that
sort
of
thing
the
empire
and
score
p
know
how
to
talk
to
nvidia
gpus
and
are
supported
on
our
platform.
A
So
the
nsight
product
family
is
looks
like
this.
Typically,
you
would
start
with
n-site
systems
to
get
a
comprehensive
application
level
view
of
your
of
what
happened
when
you
ran
your
code.
So
it's
collecting
information
on
both
the
cpu
and
the
gpu,
and
I
was
really
telling
you
things
about
when
and
where
you
had
gpu
workload
in
your
system
and
when
and
where
you
had
cpu
workload
on
your
system.
A
It
tells
you
when
your
kernels
are
running
so
kernels
are
just
the
the
name
for
the
discrete
units
of
work
that
happened
on
the
gpu,
regardless
of
which
programming
language
you
use
and
inside
systems
helps
identify
where
those
are
and,
generally
speaking,
you
use
this
to
get
a
high-level
view
of
the
performance
of
your
application
and
understand.
Am
I
using
the
gpu
effectively
at
all,
and
you
can
only
be
using
gpu
effectively
at
all
if
the
bulk
of
your
runtime
in
some
sense
is
spent
on
gpu?
A
So
if
you
have
a
gpu
accelerated
application
and
only
five
percent
of
the
runtime
is
happening
on
the
gpu,
this
suggests
that
you're,
probably
not
using
a
gpu
accelerated
gpu
compute
node
effectively.
So
insight
systems
should
really
be
used
to
answer
that
question
first,
namely
what
percentage
of
the
time
am
I
actually
spending
on
a
gpu
right,
and
you
really
want
to
maximize
that,
or
at
least
get
that
pretty
substantial,
so
that
you
know
that
you're
using
your
gpu
effectively.
A
Only
then
when
you
have
determined
that
a
particular
kernel
or
set
of
kernels
is
dominating
the
runtime
replication,
should
you
then
attempt
to
optimize
those
kernels.
So
this
talk
is
mostly
going
to
focus
on
that
process
of
diving
into
a
particular
kernel
and
then
analyzing
its
performance.
But
I
just
wanted
to
start
by
emphasizing
that
in
some
sense
this
is
not
the
first
step
of
the
process.
A
Typically,
because
in
most
cases
your
workload
will
be
more
complex
and
it's
not
just
a
single
kernel
that
dominates
the
runtime
and
you
want
to
get
to
that
place.
But
you
may
not
start
there,
and
so
in
this
workflow,
then
you
typically
start
with
insight
systems,
identify
your
particular
kernel
or
set
up
kernels
and
then
analyze
those
kernels
with
n
site
compute.
A
If
you
are
a
user
now
of
envy
prof
and
its
user
interface,
nvidia,
visual,
profiler
or
ndvp,
we're
generally
encouraging
you
to
switch
to
these
new
tools
inside
systems
and
then
say:
compute
nv,
prof
and
nvp
are
in
maintenance
mode,
so
we
are
fixing
bugs
as
we
find
them,
but
we
are
not
adding
new
features.
All
new
profiling
development
is
going
into
these
new
insight
systems
inside
compute
tools
and
in
particular
these
will
be
the
only
way
to
profile
on
promutter,
and
so
it's
definitely
worth
your
time
to
learn
how
to
use
these
tools.
A
A
You
just
do
nsys
profile
and
then
the
name
of
your
application
and,
if
you
add,
dash
dash
stats,
equals
true.
That
gives
you
a
summary
output
to
standard
out
which
summarizes
a
list
of
kernels
that
ran
as
well
as
other
operations
that
occurred.
So
that's
that's
pretty
similar
to
what
you
would
have
gotten.
A
If
you
used
envy
profit
command
line,
no
arguments,
so
I'm
going
to
use
just
the
command
line
interface
today,
because
the
sample
codes
that
we're
going
to
be
working
with
are
very
simple
and
only
have
one
or
a
few
kernels,
so
jumping
into
the
ui
won't
tell
us
much
more
than
we
can
see
from
the
command
line
output,
but
for
a
real
production
science
workload.
A
So
the
workload
on
the
cpu
threads,
as
well
as
calls
into
the
cuda
runtime
api
and
the
cuda
runtime
api
is,
is
called
into
typically,
regardless
of
your
programming
model,
so
whether
you
use
openmp,
offload
or
opengcc,
or
a
high-level
approach
like
cocos
or
raja,
or
thrust
typically
they're
calling
into
the
cuda
runtime
api
to
actually
launch
work
on
the
gpu
and
in
the
bottom
half
of
the
plot.
You
see
information
about
the
kernels
that
random
gpu,
as
well
as
memory
operations
as
well.
A
The
kernels
listed
here
are
blue
and
the
memory
operations
are
listed
in
red
here.
A
Okay,
so
I'm
going
to
jump
right
into
inside
compute,
which
is
our
kernel
profiling
tool
and
tell
you
a
little
bit
about
how
it
works.
So
nsaid
compute
is
designed
to
give
you
different
views
into
different
aspects
of
the
performance
of
your
replication
and
it's
presented
in
the
form
of
several
sections,
each
of
which
tells
you
something
about
the
performance
of
your
application,
in
particular
a
particular
kernel
from
that
application.
A
So
if
the
first
section,
which
is
always
the
one
you
typically
start
out
with,
is
the
gpu
speed
of
light
section,
which
tells
you
information
about
what
percentage
of
peak
you're
getting
for
both
compute,
which
is
this
upper
bar
and
then
memory
bandwidth,
which
is
this
lower
bar
I'll,
go
through
what
these
mean
in
more
detail
in
my
hands-on
exercises.
But
then
we
have
several
sections
that
follow.
A
One
of
the
sections
will
be
this
roofline
analysis
section
that
I
will
show
you
during
my
demo,
but
we
also
have
other
sections
like
a
compute
workload,
analysis
memory,
workload,
analysis,
that
sort
of
thing
and
again
I
walk
through
this
ui
in
detail
in
my
extra
in
my
walkthrough.
So
I'm
not
expecting
you
to
understand
all
this
now.
A
I
just
want
to
give
you
a
sense
of
what
you're,
what
you're
getting
inside
compute
has
both
a
user
in
it
like
a
gui
interface
as
well
as
command
line
interface,
and
it's
pretty
customizable
and
in
fact
one
of
the
things
we'll
see
today
is
that
you
can
customize
it
pretty
strongly
to
do
the
analysis
that
you
want
to
do
so.
One
one
example
that
I
will
talk
about
is
that
you
can
actually
create
your
own
roofline
chart.
A
So
if
you
wanted
to
add
some
roof
line
analysis
that
we
don't
provide
for
you,
it's
actually
fairly
straightforward
to
do.
The
main
challenge
is
understanding
what
hardware
counters
you
would
need
in
order
to
provide
the
information
you're
looking
for
inside
compute
has
a
workload
analysis
section
that
allows
you
to
memory
work
with
analysis
section.
A
Rather
that
allows
you
to
see
the
flow
of
memory
traffic
through
both
the
physical
memory
spaces,
so
like
l1,
cache,
l2,
cache
and
device
memory,
as
well
as
logical
memory
spaces
like
global
and
local
memory
as
part
of
its
one
of
its
sections.
It
also
has
other
things
like
compute
and
instruction
workload.
Analysis
which
we'll
take
a
look
at
insight.
Compute
does
have
the
capability
to
create,
what's
called
a
baseline,
to
compare
multiple
versions
of
a
kernel.
A
So,
for
example,
you
profile
a
kernel
and
then
you
make
some
tweak
to
it,
and
you
want
to
see.
Did
my
tweak
make
the
performance
better?
You
would
then
load
that
report
in
and
then
create
a
baseline
for
the
original
version
and
then
get
two
bars,
so
you
can
see
both
the
current
one
and
the
baseline,
and
that
would
tell
you
whether
your
performance
got
better
or
worse.
A
You
can
also,
of
course,
do
that
with
multiple
invocations
of
the
same
kernel
and
same
application
in
case
you
want
to
check
whether
the
performance
of
that
kernel
varies
as
a
function
of
time
in
your
application
and
say,
compute
can
also
allow
you
to
do
correlation
between
the
assembly
instructions
and
your
lines
of
source.
A
So
the
way
the
end
site
compute
typically
works,
is
that
under
the
hood
is
that
it's
collecting
samples
of
hardware
counters
each
of
the
instruction
in
your
assembly
code
and
then
it's
collecting
information
about
different
things
that
are
happening
at
that
particular
assembly.
Instruction
like
how
much
time
was
spent
there,
how
many
floating
point
operations,
how
many
memory
operations
are
occurring
at
each
instruction,
and
you
can,
if
you
want
to
correlate
that
back
to
the
source
code,
that
you
actually
wrote,
whether
it
be
in
c
or
in
fortran
or
some
other
language.
A
One
thing
that
I'll
emphasize,
though,
is
that
it
can
be
pretty
tricky
to
use
this
quickly.
The
fact
that
a
number
like
a
particular
line
of
code
has
the
most
samples
associated
with
it
doesn't
necessarily
mean
kind
of
naively
that
that's
the
most
expensive
binary
application.
A
Understanding
that
really
under
entails
getting
a
more
thorough
understanding
of
the
the
fact
that
gpus
are
running
many
instructions
simultaneously,
and
so
it
can
be.
One
must
interpret
this
with
caution
and
practice.
It
requires
some
experience
to
interpret
it.
A
A
I'm
gonna
only
today
use
the
command
line
interface
to
drive
the
application
and
then,
when
I
want
to
load
it
into
the
user
interface,
I
can
just
save
it
to
a
file
that
nsa
computer
knows
how
to
interrogate
and
demonstrate
the
results
from
that,
but
you
can
also
just
print
some
results
to
standard
out
if
you
want
to-
and
this
is
an
example
of
what
that
might
look
like,
and
it
has
many
of
the
same
fields
so
like
the
speed
of
light
metrics,
which
give
you
a
percent
of
peak,
are
the
same
numbers
that
you
would
see
in
those
in
that
bar
charts
that
I
showed
before.
A
If
you
want
to
profile
the
kernel
with
insight
compute,
you
just
use
the
command
line,
interface,
name
ncu,
that's
the
user
interface
script
that
does
the
profiling.
The
name
of
this
was
a
little
bit
different
than
previous
versions
of
insight.
Compute.
It
was
nv
nsight.cu-cli,
which
is
both
a
mouthful
and
also
hard
to
type.
A
So
we
hopefully
shortened
it
and
made
that
a
little
bit
nicer
for
you,
and
this
is
available
as
of
the
most
recent
release
of
insect
compute
2020.1,
which
is
available
on
cory,
so
you
just
do
ncu
and
the
name
of
your
application.
If
you
do
that
with
no
arguments,
what
that
does?
Is
it
profiles
every
kernel
in
your
application
and
because
the
the
nature
of
gpus,
we
cannot
collect
an
arbitrary
number
of
hardware
counters
at
every
invocation
of
a
kernel.
A
So
if
you
want
to
get
a
relatively
detailed
view
of
the
application,
the
insight
compute
needs
to
re-run
your
kernel
multiple
times
in
order
to
understand
its
performance
by
collecting
all
the
counters.
You
asked
for
implicitly
that
can
make
your
application
take
a
very
long
time,
potentially
orders
of
magnitude
longer
than
it
would
be
when
it's
not
being
profiled.
A
So
it's
typically
recommended
to
narrow
down
your
search
a
little
bit
in
production
applications
by
specifying
either
the
particular
current
name
that
you're
looking
for.
So
that's
with
the
dash
k,
command
or
by
profiling.
Only
a
certain
subset
of
the
invocations,
for
example,
only
profiling,
one
or
a
few
invocations
of
the
kernel
and
then
leaving
the
rest
unprofiled.
There's
another
command
line
option
to
do
that
as
well,
which
I
could
talk
about.
A
But,
as
I
said,
you
can
also
use
the
ui
for
driving
the
application.
I
won't
do
that
today,
mostly
because
on
a
typical
hpc
cluster
environment,
it
typically
makes
more
sense
to
drive
the
application
with
the
command
interface,
save
the
file
and
then
view
it
offline
user
interface.
But
if
you
were
developing
on
your
local
workstation,
you
could
use
the
ui
for
that.
A
A
Exercises
there's
a
question
in
the
slack:
do
we
need
insight
that
comes
with
kudo
11?
A
Basically,
the
answer
is
yes,
if
you
use
so
that,
let
me
say
this:
the
version
of
insight
compute
that
you
use
to
collect
the
data
should
be
consistent
with
the
version
of
insight
compute
that
you
use
to
view
the
data
it's
possible.
If
you
have
a
version,
mismatch
that
you
will,
that
will
still
work.
In
particular,
it
usually
works
for
a
newer
version
of
the
ui
to
load
an
older
version
of
the
report,
but
it
often
does
not
work
the
reverse
way.
A
A
I
will
also
note,
in
response
to
the
specific
question
that
you
do
not
need
to
install
the
nvidia
driver
to,
or
even
the
crew
toolkit
as
a
whole,
to
install
instant
compute
insight.
Compute
is
available
as
a
standalone
installer.
So
if
you
just
google
for
insight
compute-
and
I
can
give
an
example
of
the
installer
page
here-
you
can
see
if
you
click
to
this
download
button.
It'll.
A
Take
you
to
a
download
portal
in
the
innovative
developer
zone,
which
can
allow
you
to
just
download
and
set
compute
as
a
standalone
tool
rather
than
having
to
download
the
entire
toolkit,
and
that
will
be
a
new
new
enabled
version
to
support
the
analysis
that
we're
doing
today
and
I'll
also
point
out
that
this
requires
an
individual
developer's
own
login.
So
you're
willing
to
create
an
account
to
go
through
this
route.
But
it
will
hopefully
won't
take
that
long.
A
A
Okay,
so
I'm
going
to
jump
right
in
to
some
examples
now.
What
I'm
going
to
do
is
encourage
you
to
actually
walk
through
these
with
me.
If
you
want
you
can
just
watch,
but
I
think
that
you
will
gain
a
lot
more
from
this
exercise,
as
you
attempt
to
do
it
yourself,
and
so.
For
that
reason
I
will
go
through
this
relatively
slowly,
so
you
have
the
opportunity
to
follow
along.
If
you
want
to.
The
first
thing
that
you
should
do
is
clone
the
roofline
on
nvidia
gpus
repository.
A
So
this
is
the
repository
that
charlene
was
showing
earlier
it's
on
gitlab
and
I
will
just
copy
it
in
the
zoom
chat
and
if
somebody
could
copy
that
into
the
snapchat
as
well
that'd
be
appreciated.
A
She'll
want
to
go
ahead
and
get
clone
this
repo,
so
I'll
give
an
example
of
that
I
can
just
do
git
clone
and
then
that
git
repo
and
I'm
going
to
recommend
that
you
clone
it
to
nurse,
because
nurse
is
where
we're
going
to
actually
be
collecting
the
data.
So
I'm
going
to
go
ahead
and
do
what
I
just
recommended
to
get
flow
in
this
repository.
A
This
will
take
a
little
bit
of
time
to
clone,
because
the
example
code
that
we're
going
to
use
has
a
relatively
large
input
file
we're
going
to
fix
that
in
the
future,
but
for
now
it's
kind
of
a
large
download.
So
I
apologize
for
that.
You'll
see
it
takes
a
few
seconds,
so
you
can
see
I'm
doing
this
actually
on
nurse.
A
note
that
this
cuda
is
not
the
default
crude
module.
This
is
one
version
newer
than
the
default,
which
is
10.2.89,
so
go
ahead
and
just
do
module
load
cuda.11.177
explicitly.
So
you
get
that
there
is
also
an
insight,
compute
module,
which
is
kind
of
decoupled
from
the
cuda
toolkit.
I'm
not
going
to
go
through
that
today,
but
the
compute
module
always
has
the
latest
version
of
nsight
compute.
It
just
happens
to
be
that
right
now.
A
These
two
are
the
same
thing,
but
nsaid
compute
does
release
more
frequently
than
the
cuda
toolkit,
and
so
you,
if
you
want
the
latest
and
greatest
you
can
always
load
that
module
explicitly.
A
A
That's
the
actual
application
source
code
that
we're
going
to
look
at
today,
there's
also
a
ancillary
file
which
loads
the
input,
data
and
kind
of
sets
up
all
the
arrays
that
we're
going
to
work
with,
and
there
is
this
is
the
actual
input
data
file,
which
is
that
large
thing
that
I
was
talking
about,
there's
also
a
readme
which
goes
through
a
description
of.
What's
actually
in
this,
I'm
going
to
walk
through
this
with
you,
so
you
don't
have
to
read
all
of
it
now.
A
But
if
you
want
to
refer
to
this
offline,
you
can
just
look
at
this
relatively
detailed
readme
to
understand.
What's
going
on
with
the
files
that
we
provided
here,
in
particular
these
scripts
that
are
being
used
for
collecting
profiling
data,
there's
also
a
make
file
which
is
used
for
compiling
the
code,
I'm
going
to
work
with
today.
So
if
I
inspect
the
makefile
you'll
see
that
I'm
using
pgi
to
compile
openhcc
code.
So
this
is
fortran
code
and
we're
using
openhcc
as
the
parallelism
model.
This
for
the
main
exercise.
A
I'm
gonna
go
through
in
a
little
bit
one
of
the
nice
things
about
using
insight
systems
and
inside
compute
is
that
they're
pretty
agnostic
to
the
programming
model,
so
anything
that
is
capable
of
generating
nvidia
cuda
code
under
the
hood,
which
is
basically
what
this
is
doing,
can
be
used
with
insight,
compute,
and
so
that's
totally
sufficient
for
what
we're
going
to
do
today
and
not
so.
A
I
guess
the
corollary
of
that
is
that
none
of
the
principles
I
want
to
talk
about
are
specific
to
open
acc
and
I'm
mostly
going
to
ignore
openhcc
as
a
language.
When
I
look
at
nsa
compute,
but
I
am
going
to
focus
against
on
opencc
as
I
talk
about
the
optimization
of
this
particular
model,
just
because
it
we
do
need
to
think
a
little
bit
about
the
parallelism
in
order
to
think
effectively
about
how
to
use
gpus.
A
Okay,
so
what
I'm
going
to
do
is
cd
into
the
tutorial
directory
for
a
second,
and
this
is
the
set
of
files
that
we're
really
going
to
work
with
today,
and
if
I
ls
it,
you
can
see
it's
taking
me
a
long
time
to
do
this.
I
think
that's
either
it's
because
the
query
file
system
is
being
a
mess
or
it's
because
my
script,
which
does
some
good
things
under
the
hood
as
part
of
my
batch
profile,
it's
kind
of
taking
a
long
time,
I'm
not
sure
which
it
is.
A
So
what
I've
done
is
I've
created
a
readme
that
describes
the
tutorial
that
we're
going
to
work
with
today,
and
I
will
explain
what
these
files
are,
but
we
can
also
just
look
at
the
readme
in
our
browser.
So
if
you
go
to
this
tutorial
directory
and
then
look
at
the
readme,
I'm
going
to
talk
about
this
talks
about
what
we're
going
to
do
today,
so
I'm
going
to
actually
go
through
this
live.
But
if
you
get
lost-
or
you
want
to
refer
to
this
later,
the
readme
helps
kind
of
describe
in
words.
A
The
exercises
I'm
going
to
go
through
today.
So
the
first
thing
that
we're
going
to
do
is
look
at
a
very
simple
tutorial
code.
Before
we
get
to
this
more
complex
gpp
science
example
that
we
prepared
for
today,
I'm
going
to
go
through
some
very
simple,
cuda
c
kernels,
which
help
both
give
us
some
practice
actually
using
insight,
compute
to
collect
data
and
then
understanding
whether
the
roofline
analysis
and
the
other
parts
of
the
profile
that
we
collect
jive
with
our
intuition
about
how
these
individual
kernels
should
work.
A
So
I'm
gonna
go
ahead
and
open
up
this
file
in
my
text
editor,
and
what
I'm
gonna
look
at
is
three
kernels
that
are
part
of
this
simple
courtesy
application.
A
A
What
kernel
a
does
if
we
just
look
at
this,
this
main
part
of
the
work
here,
which
is
the
the
actual
thing
we're
going
to
focus
on
kernel,
a
takes
a
simple
array,
1d
array
that
we're
calling
a
this
is
an
array
of
doubles
and
just
creates
a
simple
local
variable
d,
which
is
the
result
of
adding
up
a
particular
element
in
a
100
times.
A
Well,
we're
at
rather
we're
unrolling
it
a
hundred
times
and
the
amount
of
times
we
add,
is
determined
by
this
parameter
m,
which
in
this
particular
application,
we're
handing
as
10
000.
and
so
we're
going
to
add
the
same
number
10
000
times
this
local
variable
d
and
then
restore
the
result
of
that
back
into
that
same
array
index.
This
is
obviously
an
extremely
contrived
example.
A
You
would
never
do
anything
like
this
in
a
real
science
application,
but
the
reason
I've
chosen
this
example
is
that
it
there's
a
lot
of
compute
work
here
right
and
so,
if
we
think
about
the
plot
that
that
sam
showed
earlier,
where
you're
in
both
the
bandwidth
band
regime
and
the
compute
band
regime,
this
should
probably
be
in
the
computer
and
regime
because
we're
doing
a
lot
of
floating
point
operations
for
a
relatively
small
amount
of
work.
So
I'm
going
to
pose
a
question
if
m
is
equal
to
10
000?
A
A
A
Okay,
yeah,
that's
that's
pretty
much
right.
So,
basically,
what
we're
doing
here
is
we
are
doing
10,
000,
double
precision,
floating
point
operations
and
approximately
speaking
we're
just
doing
a
single
load
and
a
single
store
right
now.
How
many
bytes
are
there
in
a
double
precision?
Word:
there's
eight!
That's
exactly
right,
so
you
could
estimate
the
arithmetic
intensity
of
this
kernel
approximately
as
ten
thousand
a
rate
now.
A
Somebody
else
in
the
chat
pointed
out
that
this
may
be
affected
by
the
cache
line
size
right
because
in
reality
we
are
not
just
learning
a
single
eight
byte
word
of
a
we're,
in
fact,
loading
a
full
cache
line.
Typically
now
the
reason
that
that
is
not
so
relevant
for
this
is
really
depending
on
the
way
the
innovative
gpus
work,
because
each
thread
in
our
cuda
kernel
is
is
accessing
a
different
location
in
a
so.
A
It's
also
true
that
multiple
threads
are
exiting
that
same
cache
line
at
the
same
time,
and
so
from
a
from
the
perspective
of
how
we
typically
would
analyze
this
piece
of
code.
Typically,
we
would.
We
would
use
that
definition
of
arithmetic
intensity
of
ten
thousand
over
eight
or
twelve
fifty.
But
what
we'll
see
is
that
in
fact
that
could
be
affected
by
the
cash,
but
it
turns
out.
A
Actually
the
arithmetic
is
in
fact
1250,
but
it's
it's
good
that
you're
already
kind
of
understanding
that
the
cash
effects
could
affect
that,
but
typically
what
we
do
in
roof
light
analysis.
Is
we
separate
that
out
right?
So
we
just
focus
on
the
number
of
bytes
moved
from
dram
in
order
to
do
this,
and
because
it
happens
to
be
the
case
that,
for
this
application
we
are
going
to
be
loading
every
element
of
a
and
coalesce
loads.
A
Kernel
b
is
actually
identical
to
a
and
I'll
explain
what
the
difference
is
between
kernel,
b
and
kernel
a
in
a
second
and
then
kernel
c
is
a
little
bit
different.
So
kernel
c
has
a
strided
memory
access,
so
a
at
some
location,
determined
by
our
unique
threat.
Energy
in
the
grid
is
equal
to
b
at
some
other
trident
index
plus
b.
So
this
is
just
a
single
double
precision.
Add
the
strided
index
is
given
by
this
formula
and
rather
than
try
to
parse
what
this
math
does.
A
That
means
that
warp
zero,
the
threads
in
warp,
zero
accessing
memory
location
is
32
bytes
apart
warp,
one
is
then
accessing
the
next
locations,
so
thread
zero
and
warp.
One
x
is
a
of
one
thread.
One
axis
is
a
of
strike,
plus
one
etc,
and
so
the
end
result
is
that
every
location
in
a
does
access
get
access
exactly
once
and
the
same
thing
is
true
for
b,
but
for
any
particular
thread.
A
The
element
of
storing
in
a
is
equal
to
a
different
offset
in
b
and
any
particular
warp
is
accessing
very
just
disjoint
locations
in
memory.
This
is
pretty
much
one
of
the
worst
access
patterns
you
can
have
from
the
perspective
of
coalesce
loads
for
particular
warp,
and
the
question
was:
am
I
assuming
32
threads
per
block,
I'm
actually
using
64
threads
per
block?
In
this
example,
I
just
used
warps
here
for
simplicity.
A
A
So
I
mentioned
that
from
a
perspective
of
coalesce
loads
from
dram.
This
is
actually
pretty
bad,
so
you
could
attempt
to
compute
what
the
kind
of
in
your
head
and,
if
you
want
to
take
a
guess
and
chat,
what
the
performance
of
this
will
be.
What
kind
of
dram
bandwidth
we'll
get
from
this
kernel?
You
could
attempt
to
do
that
so
go
ahead
and
think
about
that.
If
you
want
to
now
for
these
three
kernels,
we
are
creating
arrays
that
are
of
length
80
by
2048
by
100.
A
I've
chosen
this
number
because
80
by
2048
is
the
total
number
of
threads
that
can
be
simultaneously
resonant
on
a
single
v100,
gpu
and
v100
is
the
gpu
that
we're
going
to
use
today
on
the
query,
gpu
nodes,
and
I
have
scaled
that
by
100
just
so
that
there's
sufficient
work
to
do
and
again
these
are
double
precision
numbers
you
can
see.
I've
just
created
with
cuda
arrays,
a
and
b
of
that
length.
A
I've
just
set
them
to
zero,
defining
my
threads
for
block
and
I'm
watching
kernel
a
kernel
b
and
then
kernel
c,
and
I
mentioned
that
in
code
kernel,
a
and
kernel
b
are
identical,
but
the
difference
between
how
I'm
launching
them
is
that,
with
kernel
b,
I
am
loading.
I'm
setting
the
shared
memory
for
each
thread
block
to
be
equal
to
96
kilobytes.
This
happens
to
be
the
maximum
amount
of
shared
memory
that
you
can
request
on
a
volta
v100
gpu.
A
As
long
as
you
set
this
attribute
for
the
function
appropriately,
and
so
essentially
what
this
does
is
it
ensures
that
only
one
thread
block
can
be
simultaneously
resonant
on
a
particular
sm
rather
than
the
maximum
which
is
32,
and
so
that
will
have
effects
on
the
occupancy
of
our
gpu,
because
gpus
work
by
hiding
latency
by
having
many
threads
simultaneously
resonant
at
once,
and
so
when
one
thread
issues
memory
load,
we
can
then
shuffle
it
off
to
the
side
and
let
another
thread
come
in
and
do
some
work.
A
Okay,
so,
with
that
overview
of
the
code,
let's
go
ahead
and
compile
it
and
run
it.
So
if
you
are,
if
you
have
the
cuda
module
loaded,
which
is
just
kudo
11.167,
you
can
just
compile
it
with.
You
know
your
standard
mvcc
command
if
you've
used
created
before
this
is
how
it
looks
the
dot
cu
extension
is
just
convention
for
cuda,
but
it
doesn't
have
to
be
that
way.
There's
a
flag
you
can
use.
If
you
want
to
use
just
name
it.cpp,
for
example,
you'll
see.
A
A
This
will
make
it
take
quite
a
long
time
because
now,
where
you
see
that
we're
running
each
kernel
19
times
in
order
to
collect
the
requisite
statistics
for
the
information
that
we
requested
and
then
the
output
lists
each
kernel
one
by
one
and
then
gives
you
some
summary
output,
and
so
it
gives
you
this
gpu
speed
of
light
section
which
gives
you
some
summary
output
about
how
effectively
you
were
using
the
gpu
and
then
the
same
is
true
for
kernel
b
and
then
kernel
c.
A
Now
that
only
collects
a
relatively
limited
set
of
information,
you
can
use
the
dash
dash
set
full
command
to
collect
pretty
much
the
full
set
of
statistics
that
insight
compute
is
capable
of
collecting,
and
now
each
one
will
have
to
be
run
75
times,
because
we're
now
collecting
more
data,
which
requires
more
hardware
and
software
counters
to
be
collected.
A
So
you
can
see
this
will
actually
make
the
application
take
even
longer
to
run,
and
so
this
quickly
becomes
a
chore
if
you
have
a
real
science
application,
and
so
it's
important
to
profile
only
the
kernels
that
you're
interested
in
and
then
finally,
what
we
can
do
is
store
the
output
of
this
to
a
file,
so
I'm
going
to
do
ncu-tutorial
and
what
that
does.
Is
it
stores
the
output
of
this
to
a
particular
file
which
will
have
the
file
name
tutorial.ncu
dash
rep,
so
the
file
extension
gets
automatically
added.
A
I
just
have
to
give
it
the
the
name
of
the
file,
so
this
will
take
about
the
same
amount
of
time
to
do,
and
then
what
you'll
see
at
the
end
of
this
is
that
I
don't
get
any
summary
output
to
standard
out.
I
just
get
the
fact
that
this
report
file
was
created
and,
if
I
inspect
my
local
directory
now
you'll
see
I
have
this
tutorial.ncu
rep
file.
A
A
B
A
This
is
the
workflow
I'm
going
to
use
today
because
I
would
prefer
not
to
try
to
drive
the
gui
from
remotely
from
the
system,
as
I
think
charlene
pointed
out,
it's
possible
to
do
this
using
either
like
exporting
or
no
machine
if
you
have
to,
but
I
strongly
encourage
you
to
download
the
user
interface
on
your
local
laptop
and
then
try
it.
But
if
you
didn't
get
a
chance
to
do
that
ahead
of
time,
you
don't
want
to
do
it
now.
A
So
again,
I'm
not
going
to
do
that
today
and
in
fact
I
don't
think
I
even
have
the
right
x
forwarding
set
up.
I
guess
I
do
have
the
right
exporting
setup,
but
it
looks
like
it
crashed.
So
I
haven't
bothered
to
try
to
debug
that,
but
if
you
can
get
that
working,
that's
one
way
to
run
the
user
interface.
A
Instead,
I'm
just
going
to
load
the
user
interface
from
my
local
system,
and
it
looks
like
this
when
you
open
it,
you
get
this
pop-up
box
which,
if
you
had
any
recent
files
open,
would
show
you
this.
I'm
just
going
to
x
out
of
this
box
and
I'm
going
to
manually
locate
the
file
that
is
downloaded
on
my
system,
I'm
going
to
go
to
file
and
then
open
file,
I'm
not
going
to
use
open
project.
A
So
the
way
that
n-sec
compute
works
is
that
it
has
every
invocation
of
the
kernel
as
a
separate
launch
in
this
launch
page,
and
so
you
can
see
I've
launched
kernel
a
kernel
b
and
kernel
c
exactly
once
we
start
with
kernel
a
because
I
happen
to
be
the
first
one
that
we
launched
in
the
application
and
what
we're
going
to
look
at
is
this
first.
A
The
first
thing
you
see
is
the
gpu
speed
of
light
section
and
the
gpu
speed
of
light
section
tells
you
what
percentage
of
both
peak,
compute
and
peak
memory
bandwidth
we
achieved
for
this
particular
kernel,
so
the
the
it
goes
from
0
to
100,
and
if
your
bar
is
at
100.
This
means
you
are
effectively
using
100
of
your
compute
subsystem
effectively,
and
you
are
then
bottlenecked
by
that
right.
You
can't
get
better
than
100.
This
is
just
the
limits
of
the
machine
and
then
a
similar
thing
would
be
true
for
memory.
A
If
you
had
100
of
memory
bandwidth,
that
would
tell
you
you're
limited
by
just
the
pure
hardware
memory
bandwidth
of
the
system.
So
if
I
hide
it
over
this,
you
can
see
I
get
a
99.81
sm.
So
sm
stands
for
streaming,
multiprocessor,
that's
just
the
name
for
the
fundamental
compute
units
on
the
gpu,
so
this
means
I'm
using
the
compute
units
of
the
gpu.
Basically
100
right.
I
could
not
get
any
better
than
this
from
the
perspective
of
compute
utilization.
A
A
99.81,
okay,
if
I
scroll
down
a
little
bit,
you
can
see
that
I
have
the
roofline
chart.
So
that's
the
next
thing
that's
happening
here
and
the
roofline
chart
tells
you
what
the
arithmetic
intensity
is,
and
so
the
actual
dot
on
the
graph
is
corresponds
to
the
the
arithmetic
intensity
you
highlight
over
it.
You
see
both
the
arithmetic
intensity
as
well
as
the
performance
and
flops,
and
so
this
is
about
3.35
teraflops
per
second.
A
The
vertical
axis
is
this
number
of
flops,
and
the
horizontal
axis
is
the
arithmetic
intensity.
So
if
I
hover
over
this,
you
can
see
that
the
arithmetic
intensity
is
632.52.
How
does
that
compare
to
the
number
that
we
were
looking
at
before?
Well,
if
we
refer
back
to
our
kernel,
what
we
see
is
that
we're
doing
both
a
load
and
a
store.
A
So
when
we
said
1250
before
we
were
only
accounting
for
one
of
these
two
operations,
the
true
arithmetic
intensity
accounts
for
the
fact
that
we're
loading,
eight
bytes
and
then
storing
eight
bytes,
and
so
really
the
number
is
10
000
over
16,
which
is
about
that
625..
So
we're
getting
approximately
the
right
answer.
Approximately
what
you
would
expect
10
000
over
16.
A
now
the
this
is
intended
to
look
exactly
like
the
plots
that
sam
is
showing
before
this
diagonal
line.
Here
is
the
memory
bandwidth
bound
part
of
the
system?
So
that's
for
arithmetic
intensity
is
below
about
10.
and
then
the
square
is
located
at
the
machine
balance
point
that
sam
was
talking
about.
A
So
this
is
exactly
where
memory
bandwidth
and
compute
are
balanced
and
happens
to
be
an
error,
spec
intensity
of
about
7.5,
for
double
precision
and
about
double
that
for
floating
point
single
precision,
which
is
listed
as
floating
point
here,
and
so
the
memory
bandwidth
is
the
same
because
memory
bandwidth
is
memory,
bandwidth,
bytes
or
bytes,
but
from
the
perspective
of
the
compute
bound
part
of
it,
there's
actually
different
roofs
for
both
the
double
precision,
which
is
this
lower
bar
the
single
precision
which
is
upper
bar,
and
that
reflects
the
fact
that
the
compute
performance
of
nvidia
gpus
is
actually
not
the
same
for
both
single
precision
and
double
precision.
A
They're
about
twice
as
many
single
precision
floating
point
units
on
a
gpu
as
there
are
a
single
double
precision,
and
so
the
peak
double
precision
performance
is
about
half
of
that
of
single
precision.
A
So
we
can
see
that,
in
fact,
our
kernel
is
exactly
where
we'd
expect
it's
way
over
here
in
the
compute
bound
regime,
has
the
right,
arithmetic
intensity
and
is
relatively
close,
at
least
in
logarithmic
terms,
to
the
double
precision
roof.
Now,
if
we
look
at
the
actual
value
here,
we
can
see
that's
about
3.35
teraflops
per
second,
and
if
I
hover
over
the
double
precision
roof
line,
we
can
see
that
the
peak
the
peak
is
actually
listed
here
as
6.7
teraflops,
and
so
you
can
see
we
got
exactly
half
of
the
peak.
A
Well,
the
reason
relates
to
what
sam
was
talking
about
earlier,
that
when
we
count
flops,
the
flops
depend
how
we
count
ups
depend
on
the
operations
that
are
occurring,
so
this
does
a
double
precision,
add
which
is
one
flop
in
the
way
we
typically
count
flaps,
but
the
gpu,
and
that
you
can
do
a
single
double
precision,
add
in
a
single
clock
cycle,
but
we
can
also
do
a
single,
a
double
precision.
A
But
the
number
of
flops
that
is
associated
with
that
is
different
right,
there's
a
factor,
two
difference.
That
is
a
relevant
thing
to
consider
when
you're
counting
flops,
if
we
were
doing
an
instruction
based
roofline
like
the
one
that
sam
mentioned,
this
might
give
you
a
different
view
right.
It's
basically
saying
that
we're
limited
by
the
instruction
throughput
of
the
double
precision
pipeline,
and
that
would
be
true
whether
we
were
doing
double
precision
ads
or
double
position.
Fmas.
A
A
A
So
it's
like
a
factor
of
six
or
something
like
that
below
the
what
we
got
before
and
if
we
look
at
our
speed
of
light
section
above
here,
it's
telling
us
basically
the
same
information
that
we
were
getting
only
about
a
quarter
of
our
peak
performance
compared
to
the
100
that
we
saw
before
so.
In
kernel
a
we
got
100
and
then
kernel
b.
We
got
only
about
25
of
peak
throughput.
A
I
mentioned
in
my
talk
that
you
can
use
the
baseline
feature,
so
I'm
going
to
go
ahead
and
click
add
baseline,
which
makes
kernel
a
which
is
the
high
performing
one,
the
baseline
and
then
go
ahead
and
switch
to
kernel
b.
And
so
you
can
see
that
the
sharp
disparity
in
between
the
new
one
and
the
baseline,
which
is
now
colored
in
green.
So
the
new
one
is
blue.
This
is
the
current
and
then
the
the
new
one
is
green.
This
is
the
baseline.
A
We
can
then
plot
these
two,
both
on
a
roofline
chart
and
see
that
the
the
baseline
one
again
does
have
much
higher
performance
than
the
new
one.
The
kernel
b,
the
color
of
the
ring
here,
represents
which
one
of
these
two
things
it's
referencing.
So
the
outer
ring
of
this
data
point
is
green,
so
that
corresponds
to
the
baseline.
The
outer
ring
of
this
data
point
is
blue,
so
that
corresponds
to
the
current
one,
the
one
we're
looking
at
now.
A
If
I
were
to
scroll
down
here
and
look
at
the
occupancy
section.
What
I
would
see
is
that
the
theoretical
occupancy
of
this
kernel
is
only
about
three
percent
and
that's
because,
as
I
mentioned
before,
nvidia
gpus
can
have
as
many
as
32
thread
blocks
simultaneously
resonant
on
them.
But
we've
artificially
limited
the
number
of
thread
blocks
that
can
be
resonant
on
this
on
an
sm
by
using
shared
memory
of
96
kilobytes,
and
so
that
means
that
only
one
block
can
be
simultaneously
resonant
on
that
gpu.
A
Whereas
if
we
go
ahead
and
look
at
kernel
a
the
occupants,
the
theoretical
activity
is
100,
because
we
are
not
using
any
shared
memory
for
that
kernel,
and
so
there
are.
There
is
no
limit
on
resources
or
there's
no
contention
further
memory
resources
in
this
kernel,
so
essentially
our
theoretical
performance.
Our
occupancy
went
down
by
a
factor
of
32,
which
is
a
large
factor
in
why
our
performance
went
down
by
this
factor
of
four
or
whatever.
It
was
now.
A
The
fact
that
it
didn't
go
by
down
by
a
factor
of
32
is
something
we're
thinking
about.
I
can
kind
of
give
some
hints
about
that,
but
I'd
encourage
you
to
think
about
that.
First,
I'm
going
to
go
ahead
and
add
one
more
baseline,
so
that
now
this
kernel
b
is
the
base
line
and
go
ahead
and
switch
to
kernel
c
kernel
c.
If
I
look
at
the
roof
line
is
way
over
here
in
the
bandwidth
bound
part
of
the
regime,
you
can
see
it's
arithmetic.
Consistency
is
0.06.
A
If
you
look
at
the
code,
this
is
exactly
what
we'd
expect,
because
we
are
loading
so
here,
how
do
we
count
flops?
Well,
we're
doing
a
single
double
precision
add
right,
b,
plus
b.
So
that's
one
double
precision
floating
point
operation,
and
then
we
are
loading
b
once
and
then
storing
a
once
and
so
we're
doing
we're
loading,
we're
loading,
eight
bytes
and
then
we're
storing
eight
bytes
for
a
total
of
16
bytes
moved
and
we're
doing
a
single
floating
point
single
floating
point
operation.
A
Note
that
the
compiler
will
optimize
out
this
right,
it's
not
actually
going
to
load
b
twice
it's
going
to
load
b
once
into
a
register
and
then
just
add
that
register
to
itself.
To
this
flip
point
add
so
we're
only
loading
b
at
this
index.
Once
note,
though,
very
something
very
interesting
is
that
it's
very
close
to
the
bandwidth
bound
part
of
the
roofline,
whereas
I
said
that
from
the
perspective
of
memory
accesses.
A
This
is
one
of
the
worst
patterns
you
can
have,
because
every
warp
is
only
accessing
one
element
of
the
potential
32
that
it
could
have
been
loading
at
one
time
in
a
coalesced
load
and
the
reason
that
you
still
get
a
pretty
high,
effective,
dram
bandwidth.
This
is
just
the
dram
roofline
here
right
right,
we're
not
talking
about
l1
or
l2.
A
Cache
at
this
point
is
the
fact
that
we're
getting
a
lot
of
l2
cache
utilization
from
this
kernel,
and
so,
if
we
look
at
the
memory
workload
analysis,
we
can
see
that
our
l2
cache
has
a
hit
rate
of
90.
So
this
means
that
the
cache
is
very
effectively
saving
us
saving
our
performance
in
this
application
and
the
way
that
that
works
out
in
practice
is
that
the?
A
If
you
look
at
this
code
right
if
warp,
zero
loads,
the
cache
line
corresponding
to
this
element,
then,
if
work
one
comes
along
later,
this
element
is
already
going
to
be
in
cache.
This
element,
one
and
a
similar
thing
is
going
to
be
true
for
all
of
these
other
locations
as
well,
that
these
will
all
have
been
loaded
into
cache
as
well,
and
so
in
this
particular.
A
Now
one
thing
that
we
have
done
for
this
application
is-
or
rather
I
should
say,
one
thing
that
we've
done
at
the
nurse
installation
is
that
we
have
created
hierarchical,
roofline,
charts
that
show
l1
and
l2
cache,
and
so,
if
I
were
to
look
at
the
l1
and
l2
cache,
I
now
have
roof
lines
for
both
l2
and
l1
cache
as
well.
So
this
kind
of
is
representative
of
what
sam
was
showing
before,
and
so,
if
I
look
at
my
dram
value
here,
I
can
see
this.
A
This
is
the
0.06
and
if
I
look
at
my
l1
achieve
value,
for
example,
it's
0.02,
but
the
actu,
the
l2
value
is
actually
pretty
much
right
on
top
of
the
dram
value.
A
You
can't
even
distinguish
them
in
this
case,
which
is
a
pretty
good
example
of
what
we
were
talking
about
before
of
how
I'm
sorry,
it's
actually
right
below
the
l1,
I
should
say
rather
than
the
dram
pretty
good
example
of
how,
for
many
applications,
whether
or
not
l1
and
l2
and
dram
are
spread
out,
really
affect
your
interpretation
of
performance
application.
A
Okay,
I'm
going
to
stop
talking
about
this
toy
example.
Now
one
thing
that
I
will
say
is
I
pretty
much
stole
these
examples
from
a
wonderful
talk
that
I'd,
encourage
you
to
listen
to
which
is
at
was
was
presented
at
gcc
2019
from
last
year.
So
it's
this
talk
that
was
given
by
some
folks,
our
engineers
on
our
developers,
tools,
team,
sanjeev,
satur
and
magnus
stranger.
In
fact,
I
saw
magnus,
I
think
in
the
participants
earlier.
A
I
don't
know
if
he's
still
paying
attention,
but
he
gave
a
wonderful
talk
on
these
three
kernels,
which
really
helps
understand
the
way
that
multi-threaded
applications
work
on
gpus,
in
particular
understanding
how
warps
get
partitioned
on
sm.
So
if
you
want
to
get
a
really
detailed
view
of
how
these
kernels
play
out
from
a
performance
perspective,
I'd
encourage
you
to
go
ahead
and
check
that
out
that
talk
out
and
I've
included
the
link
to
that
at
the
bottom
of
my
slides.
A
And
so
what
I'm
going
to
do
is
look
at
this
gpp.f90
code.
I
won't
so.
This
code
is
really
a
single
kernel
that
we're
going
to
look
at,
and
one
thing
I'll
say
is
that
I'm
not
going
to
talk
at
all
about
the
science
case
that
rep
this
represents
right.
This
actually
comes
from
the
berkeley
code
called
berkeley
gw,
which
is
a
material
science
application,
and
this
actually
does
have.
A
So
I'm
just
going
to
show
you
some
code
and
we're
going
to
think
about
how
the
code
works
without
really
understanding
what
the
science
is
behind
it.
But
the
readme
for
this
for
this
repo
does
give
some
links
to
some
talks
that
charlene
and
sam
and
others
have
given
in
the
past,
which
kind
of
give
more
detail
about
this
application
and
what
the
motivation
is
for
this.
A
So
the
single
kernel
that
we're
going
to
look
at
is
this
code.
That's
right
here.
This
is
a
single
open,
acc
loop,
and
so
it's
a
tripoli
nested.
It
actually
there's
four
nested
loops
in
between
each
other,
so
there's
these
three
loops
and
then
final
loop
here.
So
this
is
just
four
do
loops
and
fortran,
and
the
only
thing
that
this
application
does
really
the
only
work
that
it
does.
Is
this
single
triple
nesting
loop
with
this
little
bit
in
the
middle
that
does?
That
is
a
length
three.
A
So
I've
given
in
a
comment
here,
the
length
the
trip
count
of
each
loop.
This
loop
just
has
a
trip
count
of
three,
but
the
trip
count
of
these
outer
loop
is
like
a
thousand
each
and
then
ten
thousand
for
this
loop
here.
So
that's
really
all
this
code
does
right.
A
If
I
scroll
down
to
the
end,
the
only
thing
that's
at
the
end
here
is
that
it
just
checks
whether
the
output
is
correct
and
a
lot
of
the
boilerplate
code
that
initializes
the
data
that
we're
going
to
look
at
is
all
off
in
this
gpp
dot
f9d
file.
And
so,
if
you
really
want
to
understand
the
data
structures,
you
would
come
here.
A
It's
a
mix
of
one-dimensional
and
two-dimensional
arrays
and
we'll
talk
about
that
in
a
little
bit,
but
this
this
initialized
data
routine
basically
does
the
work
of
loading
in
this.dat
file,
which
is
the
actual
data
for
this
and
then
allocating
and
storing
the
data
in
all
the
arrays.
I
don't
want
to
focus
on
that
today.
I
just
want
to
again
be
computer
scientist,
look
at
some
code
and
understand
the
performance
implications
of
that
code.
A
So
what
this
code
does
is
it
has
these
three
outer
loops
which
are
loops
over
some
elements
and
again
I'm
not
even
going
to
talk
about
what
the
meaning
of
these
are.
I'm
just
going
to
treat
these
as
as
code
elements
right,
so
we
have
a
loop
over
this
nt-band
dist
over
ngp
own.
You
know
for
and
cools,
and
then
these
are
the
loop
indexes
indices
and
one
lock,
igp
and
ig,
and
then
iw
is
this
inner
loop,
which
just
has
the
strip
count
of
three.
A
It
has
some
conditional
code
here
and
here
and
then
what
it
does
is
it
stores
its
result
as
a
sum
reduction
to
these
values,
ssx
array
and
sch
array,
and
now
in
the
original
code
that
this
came
from
these
were
intended
to
be
actual
arrays
like
length
three,
but
openacc
does
not
support
array
reduction,
at
least
in
the
2.7
standard
that
we're
going
to
work
with
today.
A
So
for
this
code
we
have
explicitly
broken
up
the
reduction
into
these
three
components
manually
right,
so
you
have
the
underscore
one
component
underscore
two
and
underscore
three,
and
so
we're
reducing
we're
doing
a
sum
reduction
over
six
variables,
which
are
just
three
three
components:
each
of
two
arrays.
A
If
you've
never
seen
openhc
before
that's
totally.
Okay,
I
mean
this
works
exactly
like
you
might
expect.
If
you
used,
for
example,
openacc,
the
sub
reduction
means
the
same
thing.
The
concept
of
present
just
means
that
this
data
is
already
on
the
gpu
loop
and
gang
gang
vector.
Tell
you
something
about
the
how
the
parallelism
is
mapped
to
the
gpu,
which
I
won't
focus
on
today.
A
We're
just
gonna
basically
treat
this
as
a
fully
collapsed
loop,
and
that's
really
mostly
we're
gonna
focus
on
today,
where
we
just
flatten
out
these
three
into
a
single
loop
of
length.
You
know,
like
one
thousand
times
a
thousand
times
ten
thousand,
so
we
have
a
fairly
large
amount
of
work
to
do,
which
is
good,
because
gpus
are
only
effective
if
you
can
have
a
relatively
large
number
of
degrees
of
freedom
to
work
with,
so
it
needs
to
be
typically
like
a
million
or
more,
and
this
satisfies
that
requirement.
A
One
thing
to
note
is
that
these
kernels
use
double
complex,
double
precision:
arithmetic,
I'm
not
going
to
talk
too
much
about
that,
but
just
be
aware
that
when
you
see
something
like
conjugate,
this
is
referring
to
the
complex
conjugate
which
in
fortran
is
an
intrinsic
that
you
can
work
with.
A
Okay,
just
take
a
brief
pause,
any
questions
on
either
this
code
or
people
getting
stuck
downloading
the
code
or
or
anything
like.
A
A
Okay,
I'm
gonna
keep
going
then
so
in
the
chat
somebody
asked.
Is
there
a
c
or
c
plus
version
of
this
code?
The
answer
is
yes,
but
we
don't
have
it
in
the
repo
right
now.
I
think
that's
one
of
the
things
we
want
to
do
in
the
future
is
have
a
c
version
of
this
code
and,
in
fact,
in
the
actual
berkeley
gw
code
that
this
comes
from.
A
This
kernel
has
now
been
converted
to
c
plus,
so
it
shouldn't
be
too
hard
for
us
to
create
a
sql
version
of
the
code,
but
we
haven't.
We
just
haven't
gotten
to
that.
Yet,
for
this
tutorial,
repo.
A
Okay,
so
if
we
just
compile
as
long
as
you
have
the
pgi
model,
module
loaded
you'll
be
able
to
compile
this
code.
The
makefile
has
the
right
flags
for
you
and
what
you'll
get
is
this
gpp.x
executable?
A
If
I
run
that
what
this
does
is
it
does
two
things:
it
actually
loads
the
data
and
then
runs
this
kernel.
It
reports
the
time
it
took
to
run
the
kernel
that
that
triply
nested
loop,
that
we
saw
and
also
has
some
diagnostic
output.
That
tells
you
whether
you
got
the
results
correct
or
not,
which
is
useful
for
if
you
make
some
change,
you
want
to
make
sure
that
you
didn't
get
the
answer
wrong
when
you
did
that
so
having
that
validation
is
important.
A
Now
what
I'm
gonna
use
to
help
me
collect
the
data.
Is
I'm
gonna?
I
created
a
simple
tutorial
script,
which
is
in
the
tutorial
directory.
So
you
see,
if
I
look
at
the
tutorial
directory,
I
have
this
profile
dot,
sh
script,
and
if
I
look
at
that,
it's
really
just
collect
it's
just
doing
ncu
and
then
dash
dash,
set
full
and
then
that
gpp.x,
I
have
some
logistics
in
here
which
are
kind
of
specific
to
the
nurse
can
stall
and
that
help
us
get
our
custom
hierarchical,
roofline
analysis.
A
So
those
hierarchical
roof
lines
that
I
showed
you
this
like
double
precision,
hierarchical
roofline
chart
is
not
shipping
as
part
of
the
default
part
of
the
tool,
but
we
created
this
as
a
custom
report
section
for
insight
compute
in
this
tutorial
repo
and
so
we're
just
kind
of
giving
it
to
you
and
then
maybe
later
on
in,
like
later
versions
of
encyclopede,
we'll
look
at
installing
these
as
kind
of
a
default
set
of
report
files
that
you
can
collect.
A
A
So
what
I'm
gonna
do
is
run
my
profile
script.
I
have
to
run
it
through
s
run,
so
I
can
run
on
the
gpu
and
it
takes
a
single
argument,
just
the
the
name
of
the
profile
to
make
it
simple.
So
I'm
gonna
name
my
profile
baseline
and
what
that
will
do
is
create
a
baseline.ncu
rep,
which
is
the
baseline
version
of
this
code.
Now,
unfortunately,
this
is
going
to
take
quite
a
bit
of
time
to
do
because
remember
we
have
to
profile
this
code
75.
A
We
have
to
run
that
kernel
75
times
in
order
to
collect
the
statistics,
so,
whereas
it
only
took
like
1.8
seconds
to
run,
we
were
not
profiling.
It's
gonna
take
like
a
minute
or
two
to
profile
when
we
to
click
the
profile.
A
A
The
question
in
the
chat
was:
does
nsa,
compute
work
with
libraries
that
use
cuda,
aware
mpi,
so
there's
kind
of
a
couple
things
to
break
down
there.
One
is
that
both
insight
systems
and
inside
compute
are
not
really
designed
for
large-scale
parallel
profiling,
so
you
can
use
them
to
profile
individual
mpi
ranks
and
then
just
create
an
individual
report
file
for
every
mpi
rank
and
then
for
inside
systems.
A
A
Now
the
second
part
of
the
question
was
about
cuda,
aware
mpi,
that's
a
subtlety
that
I
won't
really
get
into,
but
generally
speaking,
yes,
there
shouldn't
be
any
additional
complexion
from
the
fact
that
the
gpu
buffers
around
the
gpu,
because
that
really
is
kind
of
orthogonal
to
nc
compute,
which
is
just
analyzing
your
kernels.
It
will
just
kind
of
ignore
the
mpi
bits.
A
Now,
while
I
wait
for
this
to
finish,
I
just
want
to
say
a
couple
more
words
about
this
kernel,
and
so
we
saw
that
we
were
doing
a
three-dimensional.
A
tripoli
collapsed
loop
of
our
three
loop
nests,
which
have
meaningful
work
to
do,
and
so
that's
a
and
and
we've
chosen
that
as
the
baseline
code,
because,
basically
that's
what
you
would
do
as
a
naive,
you
know,
as
kind
of
your
naive
first
attempt
to
profiling
this
application.
A
You
generally
would
follow
the
paradigm
of
I
want
to
expose
as
much
parallelism
as
possible
on
the
gpu
right.
Gpus
are
hungry
for
work,
exposing
as
much
parallelism
as
you
can
is
a
pretty
good
rule
of
thumb.
So,
generally
speaking,
when
you
port
a
code
from
cpus
gpus
for
the
first
time,
it's
a
pretty
good
idea
to
just
expose
as
much
parallel
as
possible,
and
so
we've
taken
that
approach.
A
A
It's
actually
taking
longer
than
I
expected,
I
hope,
cory's
not
hanging
on
me.
So
if
we
take
a
look
at
this
code,
we
might
think
about
this
question
of.
Is
there
anything
we
can
do
differently?
So
one
thing
to
note
here
is
that,
because
the
trip
count
of
these
outer
loops
is
both
a
thousand
the
trip
count
of
this
inner
loop
is
ten
thousand.
A
That
is
that,
if
we
look
at
this
loop
body,
what
we
see
is
that
there's
some
work
to
do,
but
what
we're
going
to
find
is
when
we
look
at
the
profile,
we
might
have
a
question
about
whether
if
this
is
so,
the
first
question
to
answer
really
is:
is
this
a
memory
bandwidth
bound
code
or
compute
bound
code,
and
I
think
that
if
you
look
at
this
code
and
stereo
seared
it
long
enough,
you
will
not
be
able
to
figure
that
out
right.
This
is
a
relatively
complex
bit
of
code.
A
A
It
has
an
absolute
value
of
a
complex
number,
which
is
not
a
simple,
not
a
trivial
thing,
and
then
it
has
these
reduction
operations,
and
so
I
think
that
it's
relatively
hard
to
understand
just
by
inspecting
the
code,
whether
this
has
been
with
bound
or
compute
bound,
and
so
the
first
thing
that
we're
going
to
do
is
look
at
our
profile
and
try
to
understand
that
sorry.
This
is
taking
much
longer
than
I
expected
to
collect
this
profile.
A
A
So
I
got
this
baseline.ncu
rep
file
here,
so
I'm
gonna
go
ahead
and
do
the
same
process
of
copying
that
down
to
my
local
system,
I'm
gonna
scp
from
corey
same
location
reflect
on
video.gpus,
but
this
is
called
baseline.ncrep
all
right.
I'm
gonna
go
ahead
and
open
inside
compute,
I'm
gonna
clear
all
my
baselines,
I'm
gonna
close
out
this
old
tutorial
report
file
and
go
ahead
and
open
my
new
file,
which
is
called
baseline.ncu.rep
okay.
A
So
this
tells
us
that
we're
right
in
the
cusp
of
being
between
bandwidth,
bound
and
compute
bound,
there's
also
a
second
point
on
this
curve.
Here
you
can
see
this
is
a
single
precision
number.
It
turns
out
that
the
compiler
is
generating
some
single
precision
instructions,
even
though
there
aren't
any
explicitly
in
the
code,
but
this
is
so
you
know.
The
performance
of
this
thing
is
completely
irrelevant
right.
Almost
all
of
the
work
is
happening
in
this.
You
can
see
that
this
is.
A
A
The
compute
parts
of
it
and
and
be
sure
that
if
we
made
the
and
be
confident
that,
if
we
did
make
enough
compute
optimizations
that
we
could
hopefully
get
up
to
this
roof
line
right,
so
the
goal
is
to
get
up
to
the
roofline
and
the
way
that
we
could
do
that
is
by
having
some
room
to
breathe
by
being
in
the
compute
bound
part
of
the
regime
I.e
by
increasing
earthquake
intensity.
A
It's
always
the
dream
of
an
hpc
programmer
to
be
in
this
compute
bound
part
of
the
regime,
because
then
you
have
a
chance
of
using
them,
the
full.
You
know
advertised
seven
terabytes,
a
second
sticker
performance
of
the
gpu.
It's
not
always
easy
to
get
there.
In
fact,
it's
often
very
hard
to
get
there
for
many
hpc
codes,
but
our
goal
will
be
like
the
logical
first
step
might
be:
let's
try
to
move
over
to
this
computer
and
part
of
the
regime.
A
What
I'm
going
to
do
is
take
note
of
what
I
was
talking
about
earlier
about
how
I
can
choose
artificially
to
only
collapse
two
of
the
loops
rather
than
three
of
loops,
and
if
I
do
this,
then
what
that
means
is
that
one
of
the
three
loops
will
be
executed
serially
by
every
thread.
A
So
now
I'm
injecting
a
thousand
or
ten
thousand
elements
worth
of
work,
depending
on
which
of
these
three
loops,
I
choose
to
run
sequentially
in
each
thread,
so
I
could
do
something
like
this
and
what
this
would
do
is
it
would
enforce
that
each
of
these
three
each
these
two
loops
are
parallelized
among
threads,
but
both
of
these
two
inner
loops
are
then
run
sequentially
by
each
thread,
so
there's
more
work
to
do
whether
this
brings
us
to
the
compute
bound
resume
or
not
really
depends
on
kind
of
the
balance
between
floating
point
like
memory
operations,
compute
operations
inside
the
kernel,
but
generally
speaking,
giving
more
work
per
thread,
gives
us
a
pretty
reasonable
chance
of
increasing
the
arithmetic
intensity
of
that
thread.
A
A
So
let's
expose-
let's
look
at
which
of
these
three
ones,
to
do
by
looking
at
the
memory
access
patterns
of
each
of
these
arrays
in
this
kernel,
so
there's
2d,
arrays
and
1d
arrays.
Most
of
these
are
3d
arrays.
So
if
you
look
at
these
arrays,
like
w
tilde
array,
I
x
array
here's
another
iaps
array:
here's
aqsm10
local
aqsn
temp.
A
What
you
see
is
that
most
commonly
ig
is
the
innermost
loop
index
and
in
fortran
the
innermost
loop
index
is
the
fastest
moving
one,
because
it's
column
major
and
by
the
rules
of
good
performance
on
nvidia
gpus.
We
generally
want
sequential
threads
to
be
accessing
sequential
locations
in
this
fastest
moving
index
so
because
ig
occurs
very
commonly
as
the
first
loop
index.
That
means
that
typically
it's
going
to
be
best
for
performance.
We
can
guess
if
this
is
a
sequential
amount
of
threads.
A
Similarly,
igp
is
at
least
for
one
of
these
loops
here.
The
innermost
loop
index,
and
then
n1loc
here,
which
is
our
third
loop
index,
is
always
the
outermost
index
for
any
of
the
arrays
that
we're
accessing.
So
a
pretty
good
guess
would
be
for
our
performance
that
we
generally
want
to
ensure
that
array
accesses
to
ig
and
igp
are
coalesced
as
much
as
possible
and
for
n1
lock.
A
That's
the
least
important,
because
n1
is
always
strided
right
for
almost
all
of
these
arrays
and
when
locus
writed,
meaning
that
sequential
threads
could
never
possibly
access
sequential
locations
of
n1.
Look
because
it's
always
the
outermost
index,
and
so
sequential
locations
in
this
are
never
contiguous
in
memory.
A
So
that's
the
what
I'm
going
to
choose
to
do,
and
this
makes
sense,
because
when
we
collapse
a
loop,
igp
and
ig
will
be
flattened
out,
but
the
flattening
that
the
compiler
does
is
is
is
kind
of
same
in
what
you
expect
right
so
like
when
we
flatten
it
out.
A
Sequential
locations
in
ig
are
still
sequential
and
threads,
and
so
for
all
of
these
arrays
that
have
ig
as
the
first
index
you
will
have,
you
will
remain
with
coalesced
access
hugo
points
out
that
n,
when
loc
is
coalesced
in
one
of
these
arrays
occ
array
and
that's
true
right.
But
what
we're?
What
we're
kind
of
looking
at
is
just
the
balance
of
arrays
array
excesses
in
this
kernel.
A
We
see
that
most
of
the
arrays,
all
of
the
multi-dimensional
arrays
of
n1
loca,
is
the
outer
index
and
we
can
hope-
or
we
can
guess,
maybe
or
at
least
we
can
experiment
with
the
idea
that
that
will
offset
this.
A
So
it's
worth
an
experiment.
I'm
gonna
go
ahead
and
compile
it
now
with
this
change
and
then
go
ahead
and
run
it
and
see.
If
that
helps.
A
So
what
you
can
see
is
actually
that
that
really
did
not
change
the
performance
at
all
right.
It
was
about
1.8
seconds
before
and
it's
about
1.8
seconds
now.
So
that's
interesting
by
the
way.
A
If
you
didn't
follow
what
I
did
in
the
code,
I
have
a
set
of
git
patches
in
a
tutorial
directory,
and
so
those
get
patches
basically
are
the
automated
way
to
apply
what
I
just
did,
and
so
it
just
basically
describes
that
and
in
fact,
if
you
were
to
do
get
check
out
and
then
get
apply
tutorial
step,
one
dot
patch,
you
would
see
that
what
has
happened
is
I've
made
that
same
change
to
my
kernel
and
that's
exactly
the
change
that
I
just
made
where
I
made
igp
and
ig
the
outer
collapsed
loops
and
I
made
the
n1
loop,
be
a
loop,
be
sequential
so.
A
Finally,
the
next
thing
we
can
do
is
profile
this
code
to
understand,
even
though
the
code
didn't
get
faster,
did
it
achieve
the
thing
that
we
wanted
it
to
achieve,
of
making
the
arithmetic
intensity
of
this
kernel
increase?
That
was
the
goal
we
had
to
begin
with
right.
It
wasn't
necessarily
to
get
faster.
It
was
just
to
give
us
room
to
breathe
so
that
we
could
then
apply
some
optimizations,
I'm
going
to
go
ahead
and
for
the
sake
of
not
making
it
take
forever
to
profile
the
code.
A
A
So
again.
This
will
take
like
a
minute
or
two
to
collect
the
data.
That's
just
an
inevitable
consequence
of
the
length
of
the
time
it
takes
to
collect
this.
So
we'll
just
be
patient
and
wait,
and
I
can
take
any
questions.
People
have
while
we're
waiting
for.
A
A
Right
so
the
idea
from
the
question
in
chat
is
that
we're
trying
to
do
two
things
with
this
change.
First,
we're
trying
to
increase
the
arithmetic
intensity
right.
That's
really
the
the
actual
thing
we're
trying
to
achieve
we're,
trying
to
give
each
thread
more
work
to
do,
which
is
one
way
of
hoping
that
we
can
get
a
higher
amount
of
arithmetic
density,
because
we
might
have
more
flops
per
byte
moved.
A
Given
that
we've
chosen
to
do
this,
then
we're
trying
to
choose
which
of
these
three
loops
to
do
that
operation
on
and
we're
choosing
n1
lock
as
the
least
harmful
loop
to
do
it
on,
because
in
the
in
all
the
two-dimensional
arrays
that
we
looked
at
and
one
loop
was
the
outermost
loop.
So
it
is
not
sequential
locations
in
memory
and
we're
hoping
as
an
experiment
that
that
will
offset
any
of
the
other
cases
like
that
occ
array,
where
it
is
sequential
in
membrane.
A
So
there
is
no
like
golden
bullet
here
right,
there's
nothing!
We
can
do
that
would
unilaterally
make
the
performance
of
the
kernel
better
without
any
trade-off,
but
there's
always
a
trade-off.
But
our
guess
is
that
n1
lock
is
the
least
harmful
loop
to
put
in
this
innermost
loop
because
of
those
array
pattern
accesses.
A
A
Well,
I
don't
know
for
sure,
I'm
just
kind
of
hypothesizing
that
the
reason
why
n1
lok
is
the
the
one
we
want
to
use
and
not
ig
is
that
for
most
of
the
two-dimensional
arrays
ig
is
the
innermost
loop
index,
and
so
the
general
rule
of
thumb
for
nvidia,
gpus
or
really
any
gpus
is
that
sequential
threads
should
be
accessing
sequential
locations
in
memory
and
we're
paralyzing
these
loops
over
threads.
A
And
so
we
want
to
ensure
that
sequential
threads,
which
are
the
sequential
locations
or
sequential
indices
in
this
loop,
are
accessing
sequential
locations
in
these
arrays
and
and
when
loc
is
for
most
commonly
the
one
that
is
not
coalesced,
then
it's
not
contiguous
in
memory
has
access
to
these
arrays,
so
it's
probably
the
one
that
we
can
do
sequentially
within
threads,
rather
than
parallelized
among
threads.
A
A
A
A
Laptop
or
to
my
desktop,
while
I
do
that
because
of
how
long
it
takes
to
profile,
I'm
actually
going
to
go
ahead
and
apply
the
chain.
The
next
change
that
I'm
going
to
do
now
and
then
I'll.
Let
it
collect
that
data
and
then,
as
it's
doing
that,
I
will
explain
why,
like
what
this
change
was
and
why
we're
doing
it,
and
I
need
to
not
do
that
step
two.
Okay.
A
So
while
I
collect
step
two
and
I'll
explain
what
step
two
is
in
a
moment,
I'm
gonna
go
ahead
and
look
at
step,
one.
Let's
go
ahead
and
file
open
and
then
look
for
step,
one
dot,
nc
rep,
so
I'm
gonna
go
ahead
and
first
go
to
my
baseline
code
and
click
add
baseline.
So
we
can
see
what
our
performance
comparison
is
and
then
look
at
what
the
new
code
looks
like,
and
so
the
orange
is
the
baseline.
A
So
my
hypothesis
worked
out
and
in
fact
I
was
able
to
increase
the
arithmetic
intensities
by
a
factor
of
about
three,
which
is
great,
because
now
I
can
just
focus
on
optimizing
these
parts
of
the
the
compute
workload
and
hope
that
when
I
do
that,
I
just
move
upward
and
can
get
close
to
that
refined
point.
A
Now.
I
kind
of
presented
it
to
you
as
I
knew
what
I
was
doing,
but
in
reality
this
is
an
experimental
process
right.
So
this
is
something
that
both
the
berkeley,
gw
folks
and
I
have
looked
at
for
quite
a
lot
of
time
to
kind
of
figure
out,
which
is
the
right
set
of
steps
to
do.
A
But
we
didn't
know
that
a
priori
right,
so
we
kind
of
had
to
experiment,
and
you
could
do
this
kind
of
analysis
that
I
was
doing
just
by
looking
at
the
code
if
you're
experienced
enough
at
gpu
programming.
But
sometimes
you
might
just
want
to
experiment
and
try
different
things
and
then
see
how
they
affect
the
arithmetic
intensity.
A
And
if
you
had
done
one
of
these
other
experiments
that
I
mentioned,
you
might
have
seen
that
it
would
have
gone
a
different
way
and
then
maybe
that
would
be
an
indication
that
it
was
kind
of
the
wrong
direction
to
go
in
now.
If
you
look
at
the
actual
utilization,
you
see
that
in
fact
it
decreased
a
little
bit
right.
So
my
sm
and
memory
bandwidth
percentage
actually
are
a
little
bit
lower
than
the
baseline
in
both
cases.
A
But
that's
okay
right
because
we
don't.
This
is
not
going
to
be
our
final
step.
We're
just
trying
to
get
the
arithmetic
intensity
increased.
So
then,
now
we
can
focus
on
optimizing
these
parts
of
the
kernel
and
then
go
from
there.
So
this
is
just
giving
us
some
breathing
room
for
the
optimizations
that
are
gonna,
follow.
A
A
I
can
find
my
good
okay,
so
this
is
the
baseline
code
and
then
what
we
did
was
we
collapsed
two
loops
instead
of
three
and
then
we
moved
this
n1
lock
in
the
middle
and
then
that's
one
change.
We
made
the
next
change.
We're
going
to
make
is
focusing
on
these
inner
bits
of
loop.
A
A
What
we
do
is
we
move
this
iw
loop
outside
the
openhcc
region
and
the
rest
looks
like
we
have
before,
where
you
have
igp
and
ig
as
a
two
dimensional
collapsed
loop
and
then
n1
logo
is
our
innermost
loop,
which
is
sequential,
and
the
nice
thing
about
this
code
now
is
that
we
can
just
produce
over
two
values
instead
of
six,
so
sxx
array
and
ch
array,
which
are
just
single,
complex,
double
precision,
values,
and
then
this
branchiness
here
just
gets
replaced
by
a
single
set
of
values,
ssh
array
and
sch
array.
A
And
then,
after
each
of
these
three
indications
of
the
parallel
loop,
we
just
add
the
their
respective
value
to
each
of
the
locations
in
the
actual
arrays
that
we
want
to
do
the
reduction
over
right.
So
this
is
kind
of
just
a
hack
reflecting
the
fact
that
opc
doesn't
have
this
array
reduction.
But
in
fact
we
might
have
wanted
to
do
this
anyway,
to
reduce
the
richness
of
the
code
and
reduce
the
amount
of
this
code
that
is
related
to
reductions.
So
we
can
just
focus
on
doing
as
much
compute
as
possible.
A
A
A
A
So
now
what
you
see
is
a
couple
of
things.
One
is
that
if
you
look
at
the
point,
it's
actually
a
little
bit
higher
right,
so
we've
achieved
a
little
bit
higher
performance,
it's
hard
to
see
maybe
on
the
scale,
but
it
isn't.
It
is
higher
vertically.
If
you,
if
you
hover
over
the
point,
you
can
see
that
this
was
2.5
teraflops,
whereas
this
one
was
2.0
teraflops.
This
was
a
you
know:
25
percent
increase
in
and
performance.
That's
that's
definitely
non-trivial.
A
Another
thing
to
notice
is
that
our
arithmetic
intensity
actually
decreased
again,
so
we
had
an
arithmetic
intensity
of
20.
Before
now
we
have
an
arithmetic
intensity
of
10..
So
this
is
interesting
right.
Our
goal
was
to
just
move
this
point
vertically
upward
and
we
did
move
it
vertically
upward,
but
we
also
moved
it
to
the
left
right,
and
I
think
that
this
is
really
an
inevitable
consequence
of
doing
roofline
analysis
in
real
applications.
A
It
is
very
hard
to
just
move
the
point
vertically
upward,
because
real
code
doesn't
work
that
way
right.
Real
code
does
not
bend
to
our
wishes
of
just
kind
of
following
a
real
set
of
trends.
Gpus
are
complicated,
compilers
are
complicated,
and
so
it's
definitely
possible
to
move
the
performance
upward,
but
not
necessarily
without
keeping
without
changing
the
arithmetic
intensity.
A
So
what
we've
seen
is
that
we
increase
the
performance
we
are
still
in
the
bandwidth
or
the
compute
bound
part
of
the
regime,
but
we,
this
is
one
reason
why
it
was
important
to
give
us
that
breathing
room
right.
The
fact
that
we
moved
way
over
to
the
right
on
this
computer
and
part
of
the
regime
meant
that
we
had
some
breathing
rooms
that
we
make
a
change
which,
in
some
sense
decreases
the
amount
of
flops
that
are
occurring
in
the
loop,
because
we've
removed
some
of
the
work
right.
A
We
made
a
streamline,
simpler
kernel,
but
we
also
made
it
a
more
efficient
kernel.
And
so,
if
we
go
then
and
look
at
our
utilization,
we
now
see
a
story
where
we
have
a
much
higher
sm
compute
utilization
than
the
baseline
code
right.
A
And
so
this
is
pretty
nice,
because
what
this
is
telling
us
is
that,
even
though
we
had
a
little
bit
less
work
to
do
the
efficiency
of
our
work,
we're
getting
better
efficient
use
of
the
compute
units
on
the
gpu,
and
so
it's
that
that
kind
of
correlates,
with
the
fact
that
our
total
performance
went
up.
So
we
decreased
from
you
know,
1.8
seconds
to
1.4
seconds.
A
Now.
If
you
look
at
the
the
time
notice
that
that
it's
actually
a
little
bit
different,
because
we're
actually
launching
three
kernels
now
right,
because
this
court,
this
correlates
to
the
fact
that
we
are
now
launching
this
current
multiple
times
and
so
the
time
for
an
individual
kernel
is
different,
but
that
the
overall
run
time
is
more
than
a
third
lower
per
kernel,
and
so
we've
kind
of
compensated
for
that.
A
And
yes,
definitely
one
of
the
things
that
we
want
to
do
in
a
future
version
of
insight.
Compute
is
make
it
easier
to
either
make
this
a
linear
axis
or
like
zoom
in
or
something
like
that,
because
in
fact
it
is
hard
to
see
that
difference.
So
that
is
a
noted,
pain
point
and
we'll
definitely
hope
to
improve
that
in
future
versions
of
insight,
compute
and
kind
of
a
useful
add-on
point
to
that
is
that
we
definitely
want
your
feedback
on
this.
This
is
a
new
feature
and
then
site
compute
2020.1
with
code
11..
A
This
is
not
the
final
version
of
the
tool.
We're
definitely
going
to
improve
this
based
on
user
feedback.
We've
already
gotten
some
great
feedback
from
nurse
before
and
we
hope
to
get
more
feedback
from
from
the
users
on
this
call.
So
that's
definitely
one
of
the
things
we
want
to
get
out
of
today
is
for
you
to
go
ahead
and
try
this
out
on
your
own
code
and
give
us
feedback
both
on.
A
A
I'm
not
going
to
go
through
them
in
detail,
I'm
going
to
leave
them
up
to
you
if
you
want
to
do
them
later
on
and
then
I'll
close
with
some
fine,
some
parting
thoughts,
and
so,
if
we
look
at
our
code
now
from
the
end
of
step,
two
there's
a
couple
different
things
that
we
can
do
here
to
improve
the
performance
of
this
code.
A
One
of
them
is
that
the
double
precision
divides
are
challenging
because,
as
sam
mentioned
earlier
in
his
talk,
a
division
operation
does
not
map
to
a
single
hardware
instruction
right.
Division
operations
are
actually
a
sequence
of
instructions
which
implement
some
algorithm
to
do
a
floating
point,
division
and
so
in
double
precision
or
in
single
precision,
but
especially
double
precision
on
nvidia
gpus.
A
division
is
not
necessarily
an
efficient
operation.
This
does
not
map
to
a
hardware
instruction.
A
However,
there
is
a
hardware
instruction
for
computing,
the
reciprocal
of
a
double
precision
number,
and
so
in
many
codes
that
use
floating
point
math.
It
is
beneficial
to
compute
the
reciprocal
of
a
number
first,
so
we
can
compute
some
temporary
variable
like
this.
Like
that's
one
thing
you
could
do
right.
A
That
would
what
you
would
be
doing
for
a
simple
real
floating
point
number,
it's
a
little
bit
different
for
complex
numbers,
but
it
follows
the
same
principle
right
we're
going
to
compute
the
reciprocal
of
a
number
first
and
then
multiply
by
that
now.
Somebody
asked
a
very
logical
question
in
the
chat.
Why
doesn't
the
compiler
do
this
optimization
for
you?
A
Well,
the
answer
is
sometimes
it
can,
but
one
reason
why
it
may
not
is
that
that
will
change
the
result
to
at
least
to
the
round
off
the
truncation
error
of
your
floating
point
position
and
compilers,
don't
always
make
those
optimizations
which
may
change
the
answers
to
that
precision.
It
will
definitely
depend
on
the
optimization
level
of
your
compiler.
A
So
that
is
one
thing
to
consider
when
you're
writing
code
is
that
at
least
in
many
compilers
on
many
architectures,
this
is
doing
the
reciprocal
and
then
multiplying
by
the
reciprocal
is
a
faster
operation
than
doing
a
floating
point
division.
This
is
not
specific
to
nvidia
gpus.
A
There
are
many
architectures
where
that's
true
and
the
other
thing
that
you
could
do
as
an
optimization
to
this
kernel
is
to
look
at
the
fact
that
this
to
look
at
some
of
these
complex
math
operations
and
then
find
ways
to
do
them
that
are
less
compute
intensive
and
so,
for
example,
the
absolute
value
of
a
complex
number
is
not
just
you
know,
taking
the
sine
bit
and
setting
you
know,
making
it
positive.
A
It's
actually
a
more
involved
operation
for
the
absolute
magnitude,
and
so
you
could
find
a
way
to
change
the
amount
of
work
make
it
more
efficient
by
only
looking
at
like
the
squared
value
of
the
value
of
the
of
this
ieps
array
and
then
compare
it
to
just
compare
the
absolute
value
here
by
taking
away
this
app.
So
if
we
take
away
the
apps
here
in
the
apps
here,
we
can
still
do
the
same
comparison
if
we
want
to,
but
with
less
work.
A
And
so,
if
you
look
at
in
the
tutorial
the
readme
step
three
and
step
four
are
the
ones
that
I'm
looking
at
and
they
basically
describe
what
I
want
you
to
do.
For
this
complex
math
and
division
operations,
and
so
those
are
some
things
that
you
could
look
at
and
I
have
provided
for
you
a
step
three
and
step
four
dot
patch,
which
help
which
actually
describe
what
I'm
doing
in
case
you
kind
of
get
lost
on
on
that
operation.
A
Okay,
I'm
just
about
running
out
of
time.
The
last
thing
I
want
to
say
before
I
close
is
that
you
can
customize
inside
compute
to
do
your
own
roofline
analysis.
And
so,
if
you
look
at
our
ncu-sections
directory,
we
have
actually
created
for
you
some
of
these
custom
sections
files
that
do
hierarchical,
refine
analysis
and
so,
for
example,
if
we
look
at
the
hierarchical
double
roof
line,
chart
section.
This
is
an
actual
section
file
that
we
created
it's
just
a
simple
text
file
in
json
format.
A
I
won't
go
through
detail
because
of
time
constraints
like
how
the
the
format
of
this
file
works.
But
if
you
were
to
analyze
this
text
file,
you
could
kind
of
get
a
sense
of
what
it's
doing
and
then
make
changes
on
what
the
metrics
that
you're
collecting
are
and
then
create
your
own
refine
analysis,
and
so
when
we
were
showing
before
these,
like
double
precision,
roof
line
charts.
These
are
actually
things
that
we
created
on
our
own
and
then
just
added
to
our
installation.
A
Compute
is,
you
can
add
your
own
section
files,
and
so
one
thing
you
can
try
is
creating
your
own
custom
section
file
and
if
you
find
that
it's
really
useful,
you
can
send
it
to
us
as
feedback
and
we
can
consider
adding
it
to
a
new
version
of
the
tool
in
the
future
or
you
can
get,
for
example,
your
local
hpc
center
to
install
it
in
their
installation
of
insight,
compute,
and
so
in
fact,
if
you
look
at
the
installation
of
insight
compute
at
nurse,
I'm
not
in
the
right
window
and
you
look
at
the
cuda
installation
of
insight
compute
in
create
11,
which
is
here,
you
look
at
the
sections
directory.
A
You
can
see
that
in
fact,
nurse
has
installed
these
section
files
that
we
created
for
this
tutorial
there.
So
you
can
actually
take
advantage
of
those
directly
in
this
installation
if
you
want
to
otherwise
you
could
just
download
it
on
your
own
and
then
copy
the
section
file
in
there
and
then
use
it
just
the
way
that
I've
shown
in
my
profile
script.
A
Okay,
so
that
was
my
kind
of
one
hour
and
a
half
introduction
to
insect
compute,
refined
analysis
on
both
the
tutorial
example
and
the
gpp
example
later
today
you
can
either
apply
insight
compute
on
your
own
code,
or
you
can
do
these
steps
three
and
four
that
I've
shown
off
in
the
gp
exercise.
If
you
want
to
kind
of
dive
deeper
any
questions
before
we
break
for
lunch.
B
A
Well,
nvcc
is
not
the
compiler
we're
using
for
this.
I
think
it
is
true
that
mvcc
can
do
that
optimization.
I
don't
remember
off
top
of
my
head
what
how
it
works
for
double
precision.
In
particular,
we
made
double
precision
divides
much
more
efficient
in
cuda
11
compared
to
crude
10,
which
is
what
we're
using
now.
So
this
operation
may,
in
fact,
be
a
lot
less
necessary
include
11..
I
haven't
checked
that
yet.
B
So
it's
a
great
tutorial
thanks
max,
I
guess
we'll
break
for
lunch
and
be
back
at
a
quarter.
Quarter
past
quarter
past
one
so
feel
free
to
post
your
questions,
or
you
know
issues
on
slack
or
on
the
google
doc
I'll,
be
monitoring
all
those
places
and
I'll
see
you
guys
in
about
an
hour.