►
From YouTube: Roofline Hackathon 2020 part 1 and 2
Description
What is Roofline performance model, and the mechanism behind its data collection on NVIDIA GPUs
A
B
Right
so
this
will
provide
a
brief
introduction
to
the
roofline
model.
It's
not
designed
to
be
an
exhaustive
survey
of
all
the
kind
of
roofline
research.
That's
happened
over
the
last
twelve
thirteen
years,
but
we'll
give
you
a
basic
introduction
to
what
the
model
is
and
in
general,
how
one
might
apply
it
in
the
abstract,
so
view
acknowledgments
start
out
with
a
motivating
question.
So
let's
say
you
just
spent
the
last
six
months:
porting
your
application
to
GPUs
the
question
becomes:
are
you
done?
Is
it
worth
it?
B
Did
you
actually
make
good
use
of
your
resources
and
to
answer
that
question
you
need
to
get
at
the
question
of
what
is
good
performance.
That
is,
if
you're
getting
good
performance
on
the
GPU
you're,
probably
done
you
should
move
on
to
other
activities
in
order
to
to
further
your
research.
So
let's
imagine
that
you
took
your
application
and
you
profile
the
mix
of
loupe
nests
within
your
application
running
on
the
GPU
and
for
some
arbitrary
ordering
of
lutenist.
B
You
get
these
completely
random
different
flop
rates
for
different
loops,
but
if
some
of
them
perform
a
very
high
flock
rates,
some
of
them
get
very,
very
low
flop
rates.
That
means
that
the
operate
alone
is
not
particularly
insightful.
It
didn't
really
tell
you.
Are
you
getting
good
performance
because
some
of
them
are
fast,
so
I'm
a
remorse
low?
Second,
you
could
think
about.
Let's
just
take
your
existing
code
and
run
it
on
a
Xeon
or
an
AMD
epoch
and
see
what
kind
of
baseline
performance
you
get.
B
You
could
then
use
that
baseline
performance
and
compare
to
the
GPU
performance
and
conclude
whether
you're
getting
good
performance.
The
problem
is:
is
that
that
really
just
tells
you
relative
speed
up?
That
is
some
kernels
got
enormous
speed,
ups,
that
it
was
dramatically
orders
of
magnitude
faster
on
the
GPU
than
it
was
on
the
CPU,
but
then
there
are
other
kernels
like
this
one
for
which
the
performance
speed-up
may
have
been
actually
very,
very
modest,
and
only
a
slight
increase
in
performance.
B
The
second
aspect
of
good
performance
is
that
you
want
to
be
making
good
use
of
the
GPUs
compute
and/or
bandwidth
capabilities
that
is
GPU
has
tens
of
teraflops
of
performance
of
compute
and
ballpark
terabyte
per
second
of
bandwidth.
You
want
to
make
sure
that
you're
using
one
or
both
of
those
two
to
their
fullest
extent.
B
Ultimately,
what
we
really
need
is
this
quantitative
model
rather
than
these
qualitative
or
relative
statements.
That
is,
we
don't
want
to
just
say:
okay,
it's
kind
of
good.
We
want
to
be
able
to
say
we're
getting
90%
of
our
theoretical
limit
or
we
don't
want
to
be
able
to
just
say
we're
twice
as
fast
as
a
Xeon.
We
want
to
be
saying
we're
making
80
per
square
at
aiming
80%
of
our
GPUs
compute
capability
or
its
bandwidth
capability.
B
B
Ultimately,
the
way
the
model
is
constructed,
it's
basically
independent
of
the
instruction
set
architecture,
so
it
doesn't
matter
whether
it's
a
risk
or
Cisco
architecture
doesn't
matter
if
it's
x86
or
power,
and
it's
also
implement
in
independent
of
the
underlying
architecture.
This
means
it's
applicable
to
a
CPU
or
GPU,
or
even
a
TPU.
B
So
let's
imagine
just
to
begin
with
that.
We
are
running
on
some
kind
of
superscalar
Xeon.
Well,
in
this
particular
case,
we
have
a
skylake
Xeon
if
just
looking
at
the
single
core
architectural
diagram,
it's
incredibly
complex,
that
is
we'd,
be
the
number
of
stages.
The
number
of
operations
implied
in
the
complexity
of
this.
B
This
architecture
makes
it
very
hard
for
any
individual
to
contemplate
how
this
will
respond
to
different
code,
so
one
option
would
be
to
try
to
just
run
us
build
a
simulator
or
this
guy
like
CPU,
and
then
run
our
code
through
this
simulator
to
try
to
make
a
prediction
of
what
the
performance
would
be,
but
that
doesn't
really
give
us
insight
as
to
what
the
bottlenecks
are
in
performance.
It
only
really
tells
us
how
does
performance
responded?
B
Slight
changes
in
architecture,
so
it
doesn't
really
give
us
that
that
high
level
intuition
worse
simulation
is
incredibly
slow.
It's
gonna
be
orders
of
magnitude
slower
than
simply
running
the
code
itself.
What
we
really
want
our
performance
models
that
are
orders
of
magnitude
faster
than
just
running
to
code
by
itself.
So
what
we
want
to
do
if
we
want
to
take
this
incredibly
complex
with
you
about
a
processor
architecture
and
simplify
it
down
into
a
very,
very
simple
model
of
what
these
cores
look
like.
B
So
we
might
take
this
this
kind
of
high
level
view
and
make
the
assumption
that
the
individual
cores
in
this
machine
can
attain
peak
flops
if
they
operate
on
local
data.
That
is
if
the
data
is
local.
On
cache,
you
always
get
peak
flops.
You
might
assume
that
all
the
cores
are
load
balanced,
running
a
single
program,
multiple
data
code.
That
means
that
you
don't
have
any
kind
of
on
bail
effects.
You
don't
have
any
kind
of
load
imbalance
effects.
B
They're
all
the
cores
are
doing
the
same
thing,
and
thus
you
can
collapse
them
down
into
one
set
of
cute'
capability.
You
might
make
assumptions
like
there's
sufficient
cache
bandwidth
and
cache
capacity
such
that
cache
capacity,
misses,
aren't
really
affecting
performance.
The
only
real
effect
on
performance
you
have
is
how
fast
you
can
do
compute
and
how
fast
you
can
move
data
on
and
off
chip,
so
this
kind
of
high-level
model
is
really
the
basis
for
what
we
would
call
the
DRAM
roofline
model.
B
So
in
this
case,
all
we're
really
thinking
about
is
data
movement
to
and
from
DRAM
and
compute.
So
in
that
vein,
the
model
is
basically
premise:
taun
answering
the
question
which
is
going
to
take
longer.
Does
it
take
longer
to
move
data
on
and
off
chip,
or
does
it
take
longer
to
do
the
computation
once
the
data
is
actually
on
chip?
So
we
can
write
a
simple
equation.
That
says
here
is
kind
of
what
we
expect
for
a
runtime.
It
is
the
runtime.
B
It
should
be
the
maximum
assuming
perfect
overlap
of
the
number
of
floating-point
operations
in
our
loop
nest,
divided
by
the
peak
swap
rate
of
our
machine
and
the
number
of
bytes
that
have
to
be
moved
on
and
off
the
chip
and
the
peak
bandwidth
to
the
machine,
as
I
mentioned.
This
assumes
perfect
overlap
of
these
two.
If
you
don't
have
perfect
low
four-lap,
then
you
have
to
sum
these
two,
but
for
the
basis
of
the
roofline
model.
In
this
example,
we
will
always
assume
perfect
overlap
of
communication
computation
now
to
transform
this
into.
B
What's
nominally
a
roofline
model,
we
needed
to
think
about
rates.
So
if
we
take
the
original
equation
and
just
divide
both
sides
by
the
number
of
floating-point
operations
in
our
code,
we
can
transform
the
equation
into
this,
and
if
we
reciprocate
it
one
more
time,
we
actually
get
a
slightly
different
equation.
That
is,
the
flop
rate
is
going
to
be
the
minimum
of
either
the
peak
gigaflops
of
the
machine
or
the
product
of
what
we
call
arithmetic,
intensity
and
peak
bandwidth.
Now
this
equation
is
the
core
equation
for
roofline.
B
It's
the
most
basic
equation
in
roofline,
that
is
all
also
the
most
universal,
but
buried
in
this
equation
is
an
incredibly
important
term
that
I
alluded
to
this
arithmetic
intensity.
This
is
the
ratio
of
the
number
of
floating-point
operations
to
the
number
of
bytes
in
your
loop
nest.
So
in
essence,
this
arithmetic
intensity
is
a
measure
of
data
localities.
B
How
much
data
reuse
each
of
your
loot
nests
actually
have
that
is.
It
is
that
ratio
of
the
total
number
of
floating-point
operations
performed
in
that
lutenist,
divided
by
the
total
number
of
bytes,
moved
on
and
off
the
chip
for
that
lutenist.
So
for
the
deep
DRAM
roofline
model,
this
is
total
bytes
to
and
from
daemon,
and
that
means
it
includes
all
ova
cache
all
the
prefetcher
effects,
any
kind
of
speculation
effects,
any
kind
of
data
that
moves
on
and
off
chip
has
to
be
included
in
the
denominator
of
arras
metic
intensity.
B
That
means
that
it
can
be
very
different
from
the
total
number
of
loads
stores
in
your
loop
nest.
That
is
a
load
and
store
is
just
a
request
to
the
memory
subsystem
to
bring
data
into
a
register,
but
the
cache
hierarchy
is
right
there
to
filter
all
of
those
loads
and
stores
and
distill
them
down
only
to
a
compulsory
set
that
actually
have
to
go
to
DRAM.
B
So
one
other
way
of
viewing
this,
rather
than
just
viewing
it,
as
the
total
number
of
flops
divided
by
the
total
number
of
bytes,
is
a
ratio
of
sustained
flop
rate
to
sustained
bandwidth.
Basically,
in
that
case,
time
old
will
cancel
out
in
both
terms,
so
you
can
view
it
as
in
either
form
for
most
cases.
We
will
view
it
as
the
ratio
of
the
number
of
floating
point
operations
divided
by
the
number
of
bytes
and
construct
performance,
instrumentation
technologies
geared
to
measure
those
two
terms.
B
So,
let's
think
about
how
we
go
about
visualizing
this.
So
if
we
take
this
basic
equation,
this
minimum
of
flop
rates
in
the
product
of
AI
and
peak
bandwidth,
we
can
plot
this
as
they
you
roofline
bound
using
arithmetic
intensity
as
the
x-axis
for
a
number
of
historical
reasons.
We
always
plot
it
on
a
log
log
scale
for
one.
This
makes
it
incredibly
easy
to
doodle
on
a
whiteboard
to
brainstorm,
to
think
about
how
you
might
have
orders
of
magnitude
different
bandwidths
or
alternately.
B
How
Moore's
Law
has
allowed
you
to
have
orders
of
magnitude
faster
CPUs
and
GPUs
over
the
years?
That
is,
those
orders
of
magnitude,
become
linear
steps,
and
thus
data
does
not
get
squashed
into
the
origin,
but
you
see
it
well
separated.
So
in
this
case
the
vertical
axis
is
the
attainable
clock
rate
for
our
loot
nest.
The
y-axis
is
going
to
be
the
arithmetic
intensity.
One
of
the
terms
in
this
equation
is
the
peak
flop
rate
of
the
machine.
The
other
term
in
this
equation
is
the
product
of
arithmetic,
intensity
and
peak
bandwidth.
B
The
model
itself
imposes
a
minimum
function
on
these
two,
and
thus
we
end
up
being
constrained
to
say:
performance
must
be
on
or
below
this
line.
Now,
there's
one
very
important
facet
in
this
figure.
There
is
the
transition
point
where
you
actually
transition
from
being
limited
by
memory,
bandwidth
to
being
limited
by
the
peak
flow
rate
of
the
machine.
That
transition
point
is
the
Machine
balance.
That's
also
an
incredibly
important
term
in
this.
In
the
reply
model,
the
machine
balance
is
the
ratio
of
the
flop
rate
of
the
machine
divided
by
the
peak
bandwidth.
B
So
in
essence,
this
provides
a
dual
form
to
what
we
see
with
arithmetic
intensity,
whereas
arithmetic
intensity,
characterizes
applications,
machine,
balanced,
characterizes
architecture.
In
both
case
cases,
they
are
the
ratio
of
flops
to
bytes
for
applications.
It's
total
flops
move
divided
by
total
by
our
total
flops
performed
divided
by
total
bytes,
moved
or
machine
balanced.
It's
the
peak
floppity
or
your
machine,
divided
by
the
peak
bandwidth
of
your
machine.
B
So
in
essence
the
roofline
model
will
tessellate
this
space.
This
two-dimensional
space
of
flops
and
an
AI
into
five
separate
regions.
Those
five
regions
are
important
to
think
about.
First
of
all,
we
have
this
region
above
the
dotted
pink
line.
This
is
unattainable
performance.
This
is
faster
than
the
speed
of
light.
You
can
never
actually
be
have
a
application
run
in
this
regime,
because
it's
basically
saying
you
are
executing
come
compute
rates
faster
than
what
the
machine
is
capable
of
doing.
B
So
we
can
just
offer
that
say
we
can
never
have
a
dot
up
in
this
region.
Second,
we
have
this
region
here,
where
we
are
less
than
the
machine
balance
less
than
this.
This
vertical
line,
but
also
greater
than
the
Machine
bandwidth,
that's
also
unobtainable
performance.
Basically,
it
says
that
you
don't
have
bandwidth
to
actually
operate
your
GPU
in
this
regime.
That
is,
for
the
amount
of
data
locality.
You
have,
you
have
sufficient
compute
capability,
but
you
don't
have
sufficient
bandwidth
to
operate
a
GPU
in
this
regime.
The
third
regime
is
the
bandwidth
bound
regime.
B
We
have
a
low
arithmetic
intensity,
that
is,
our
arithmetic.
Intensity
is
less
than
the
Machine
balance,
but
we
also
are
operating
less
than
the
memory
bandwidth
of
the
machine
may
be
within
50
percent
of
your
machines
peak
bandwidth.
We
would
describe
that
as
a
bandwidth,
bound
regime,
your
operating
and
then
and
you're,
actually
getting
pretty
good
performance
you're
getting
50
percent
of
what
the
roofline
can
tell
you
that
you
can
get
it's
not
a
hundred
percent,
but
you
are
somewhat
constrained
by
the
memory
bandwidth
of
the
machine.
The
4:3
gene
is
this
compute
limited
regime.
B
So
in
this
case
we
are
right
of
the
machine
balance.
That
is,
we
have
lots
of
data,
look
data,
reuse,
very
high
data
locality,
but
we're
not
actually
getting
peak
flops
we're
getting
somewhere
between
50
and
100
percent
of
the
peak
flop
rates
of
the
machine
and
then
finally,
we
have
this
regime
where
we
are
actually
below
50
percent
of
bandwidth
or
below
50
percent
of
the
compute
capability
and
machine.
We
might
describe
this
as
poor
performance,
so
in
this
relation
here.
B
B
So,
let's
consider
an
example,
so
our
typical
machine
balance
today
is
somewhere
between
five
and
ten
floating-point
operations
per
byte.
Now,
remember:
that's
floating-point
operations
per
byte,
and
if
you
want
to
convert
that
to
floating-point
operations
per
double
precision,
word
you
need
to
multiply
by
eight
that
means
you're
doing
somewhere
between
40.
You
have
to
do
somewhere
between
40
and
80
floating-point
operations
per
precision
word
in
order
to
guarantee
you
our
compute
limited.
That's
where
the
Machine
balance
the
machine
transitions
from
being
combined
with
limited
to
being
compute
limited.
B
Fundamentally,
that's
an
artifact
of
technology
and
money
and
the
way
epic
ations
are
driven
today,
it's
very,
very
unlikely.
That's
going
to
dramatically
improve
in
the
future.
That
is
in
the
future.
If
you
improve
bandwidth
you're
more
than
likely
going
to
improve
compute
by
even
more
so,
we
can
mark
this.
This
transition
point
at
five
flops
per
byte.
This
dishmachine
balance.
We
can
then
consider
a
very,
very
simple
vector,
vector
operation
in
this
case,
we're
going
to
take
two
vectors
x
and
y
scale
Y
by
a
constant
alpha.
B
Add
it
to
X
and
store
it
into
a
third
vector
Z.
If
we
think
about
this
code,
we're
going
to
do
two
floating-point
operations
per
iteration,
that
is,
we
do
an
addition
and
a
multiplication
on
every
iteration
and
we
transfer
24
bytes
to
and
from
DRM.
That
is.
We
have
to
read:
Y
8,
bytes
read
X
8
bytes
write,
Z,
8,
bytes,
8,
plus
8
plus
8
is
24.
B
So
if
we
take
that
arithmetic
intensity
of
this
lutenist,
we
take
the
to
flux
divided
by
the
24
bytes
and
we
get
points
0,
8,
3,
flops
per
byte.
That
means
that,
for
a
vector
vector
like
operations,
this
kind
of
Blas
1
like
operation,
we
are
going
to
be
extremely
memory
limited,
that
is,
we
are
far
far
below
the
Machine
balance
of
this
machine,
which
means
that
fundamentally,
these
operations
are
memory,
bandwidth
limited
and
they
will
perform
at
a
very,
very
low
slop
rate.
B
Now,
let's
consider
a
more
complicated
example
that
has
some
degree
of
reuse.
So
in
this
case
we
took
some
kind
of
laplacian
operator.
We
did
a
second-order,
desperate
ization
of
it,
producing
a
7-point,
constant
coefficient
stencil
in
this
case
we're
going
to
basically
read
from
six
seven
points
in
memory.
Basically,
a
star
shapes
stencil
and
write
to
a
new
vector,
a
new
grid
in
memory.
B
So
if
we
go
and
look
at
this
we're
going
to
do
seven
floating-point
operations,
that
is
the
six
editions,
the
one
multiplication
by
a
constant
and
we
do
a
memory
references
that
is.
We
read
these
seven
points
and
we
write
this
other
point.
So
one
might
think
that
your
AI
is
point
one
one
that
is:
do
we
really
take
those
seven
flops
and
/
64
bytes?
B
Well,
that
gives
us
an
AI,
but
the
problem
is:
is
that
that's
not
the
right,
a
I
that
we
want
that
arithmetic
intensity
is
really
the
arithmetic
intensity,
measured
at
the
Elwha
unlevel.
The
thing
to
remember
is
that
for
roof
line
for
the
DRAM
roof
line,
we're
always
want
to
measure
the
data
moving
on
and
off
chip,
and
the
observation
here
is
that
a
perfect
cache
hierarchy
will
filter
out
all,
but
one
read
and
one
write
of
this
operation
of
this
lutenist.
B
That
means
that
in
reality,
only
one
of
these
memory
references
will
actually
miss
in
the
cache,
the
other,
the
other
memory
references
will
all
hit
in
the
cache
and
thus
not
incur
DRAM
data
movement.
So
if
we
actually
do
the
calculation,
we
actually
get
a
different
arithmetic
intensity.
We
get
at
the
seven
flocks
divided
by.
B
Its
sorry,
the
seven
flocks
divided
by
sixteen
bytes,
which
gives
us
the
point
four
four
flops
per
byte-
that's
the
ideal
arithmetic
intensity
for
this
kind
of
seven
point
stencil.
So
if
we
think
back,
where
was
triad,
where
was
our
vector,
vector
operation?
Well,
that's
way
down
here
at
0.083.
If
we
do
a
seven
point,
stencil
well
we're
gonna
have
five
times
the
arithmetic
intensity,
but
remember
0.44
is
still
less
far
less
than
the
five
flops
per
byte,
which
means
that
we're
still
a
heavily
memory
bandwidth
bound.
B
So,
let's
think
back
to
this
original
motivating
question
what
is
good
performance
so
if
we
think
back
to
our
lutenist,
so
we
have
this
random
assortment
of
lewdness
from
our
porting
exercise
of
courting
our
application.
To
achieve
to
you
initially,
it
looks
like
there's
no
real
rhyme
or
reason
to
actual
performance.
However,
if
we
were
to
so
sort
our
kernels
based
on
their
arithmetic,
intensity
and
plot
them
accordingly,
on
the
x
axis,
we
can
then
compare
performance
of
our
individual
loop
nests
to
the
Associated
reply
model.
B
However,
there's
a
few
kernels
that
are
actually
outside
that
region,
so
we
can
actually
observe
first
of
all
that
we
can
have
kernels
that
have
low
performance,
so
this
kernel
down
here
is
very,
very
low
performance.
That
is
the
the
y-coordinate
of
this
kernel.
The
flop
rate
of
this
kernel
is
very
low,
but
we
can
actually
say
it's
making
good
use
of
the
GPU,
because
it's
actually
getting
a
very
high
fraction
of
the
memory
bandwidth
of
the
GPU.
B
So
we
can
then
focus
our
performance,
optimization
efforts
on
trying
to
address
these
red,
kernels
and
kind
of
bypass,
spinning
our
spending
a
bunch
of
time
trying
to
optimize
these
green
kernels,
because
we
can
only
get
slight
increases
in
performance
for
them.
So,
as
a
recap
for
this
first
section,
the
roofline
model
is
made
of
two
components.
You
have
the
lines,
the
model
itself,
the
machine
model
itself,
which
are
all
the
lines
in
the
roofline
model.
B
Those
are
your
bandwidth
lines,
your
50%
of
string,
your
peak
flow
rate
lines
by
definition,
defying
the
machine
balance,
the
transition
point
from
where
you
transition
from
being
bandwidth-limited
to
compute
limited
those
lines
are
going
to
be
unique
to
each
architecture.
So
if
you
run
those
on
a
CPU,
you'll
get
one
set
of
lines.
If
you
run
those
on
a
volta,
you'll
get
a
different
set
along
lines.
B
B
The
other
aspect
of
the
roofline
model
is
application
characteristics
in
this
case.
These
are
all
the
dots.
The
dots
are
basically
defined
by
the
number
of
flops
that
an
application
performs
and
the
number
of
bytes
that
it
moves.
That
is,
the
x-axis
is
the
ratio.
The
x-coordinate
is
the
ratio
of
those
two
and
the
y-coordinate.
Is
the
ratio
of
flops
and
run
time.
This
means
each
dot
is
unique
to
a
loop
nest.
B
So,
let's
think
about
what
the
general
performance
strategy
is
for
using
roofline
well,
first
of
all,
if
you're
a
far
below
the
peak
flop
rate
of
the
machine
that
is
you're
in
the
compute
limited
regime
like
this
red
dot,
but
you
are
far
less
than
the
50%
of
peak.
Your
real
goal
is
to
try
to
improve
the
performance
of
that
individual
lewdness.
That's
that's
kind
of
obvious
right.
You
want
those
loop
nests
to
actually
run
faster,
but
there
is
a
subtlety
that
is
when
you're,
actually
in
the
bandwidth-limited
regime.
B
That
doesn't
quite
mean
that
you're
completely
done
the
way
you
actually
improve
performance
when
you're
in
the
bandwidth-limited
regime
is
to
increase
AI
to
increase
AI,
the
arithmetic
and
density.
You
have
to
decrease
the
denominator.
That
is,
you
decrease
the
data
movement.
You
decrease
data
movement
by
improving
spatial
locality.
You
increase
the
decrease
data
movement
by
improving
cache
blocking
you
choose,
alternate
data
structures
or
alternate
data
types
that
require
less
data
in
any
of
those
veins.
You
may
actually
reduce
the
data
movement
for
that
loop
nest
by
reducing
the
data
movement.
B
So
you
know
I
kind
of
alluded
to
this
early
on,
but
you
know
you
end
up
with
the
kind
of
question:
how
can
performance
ever
be
below
the
roofline?
How
do
you
ever
end
up
in
that
regime,
where
your
dot
is
not
just
right
smack
on
the
the
roof
line?
Performance
balance?
Well,
there's
a
number
of
different
ways.
So,
first
of
all,
we
can
have
dots
that
are
actually
misplaced.
That
is
the
kernel
itself.
You
may
have
been
doing
your
instrumentation
activity.
You
may
have
calculated
the
number
wrong
number
of
floating-point
operations.
B
You
may
have
calculated
the
wrong
number
of
bytes.
This
can
occur
from
a
number
of
reasons
you
can
have
broken
hardware
or
software
performance
counters.
You
can
make
assumptions
on
how
many
flops
are
actually
being
performed
based
on
an
operation
or
the
way
you
calculate
how
many
bytes
are
being
moved
may
be
off.
B
Second,
the
lines
may
be
misplaced.
That
is,
you
may
look
at
a
architectural
manual
and
say:
okay,
here's
the
peak
bandwidth
of
the
machine
and
here's
the
peak
flop
rate
of
the
machine.
The
problem
is:
is
that
that
may
be
an
overestimate.
That
may
be
the
ideal
assumption
on
what
the
roofline
should
look
like
that
target
architecture.
The
way
you
really
construct
a
roofline
model
is
an
empirical
approach.
That
is,
you
actually
have
to
benchmark
the
memory
bandwidth,
the
machine
and
the
peak
flop
rate
of
the
machine.
B
If
you
fail
to
do
that,
depending
on
the
architecture,
you
could
be
off
by
as
much
as
20%
those
assumptions.
They'll
also
assume
that
you're
a
perfectly
load-balanced,
that
is,
if
all
the
cores
or
all
the
SMS,
are
all
driving
memory
subsystem.
At
the
same
time,
you
get
one
bandwidth
if
only
one
SM
is
driving
the
memory
subsystem
and
the
other
79
on
a
voltar
are
completely
idle.
You
will
never
get
peak
bandwidth.
B
The
third
way
you
can
be
below
to
the
roofline
is:
there
could
be
missing
lines.
That
is,
there
could
be
bounds
other
than
even
there
flops.
The
original
equation
we
did
was
based
on
an
incredibly
simplified
model,
where
we
distilled
down
that
incredibly
complicated
architecture
into
just
compute
and
data
movement,
but
the
reality
is,
is
we
can
back
off
on
a
few
of
those
assumptions?
B
So,
let's
think
about
a
few
of
those
cases
and
how
to
actually
rectify
them.
So,
first
of
all,
let's
think
about
the
model
or
application
instrumentation
issues
causing
us
to
be
below
the
roofline.
So,
as
I
mentioned,
those
theoretical
performance
specifications
that
you
may
get
may
be
highly
optimistic,
the
D
ramp
in
bandwidth.
That
is
the
number
of
bits
times.
The
frequency
versus
the
sustained
bandwidth
could
be
quite
different.
B
You
may
on
modern
architectures,
whether
they
be
CPUs
or
GPUs,
fall
into
a
turbo
mode,
where
you
actually
run
at
a
higher
frequency
for
a
short
burst
of
line,
or
you
may
be
under
clocked,
because
you're
thermally
limited
in
either
case
that
can
affect
your
overall
compute
capability
and
then
there's
the
more
subtle
aspect
of
what
happens
when
you
have
a
really
really
complicated
loop
nest
and
the
compiler
just
gives
up
you
may
say
this
should
never
happen.
It
should
never
happen,
but
the
reality
is
it
does.
B
There
are
times
where
the
compiler
just
balks
at
overly
complicated
code
and
generates
poor
quality
code.
So
what
we
really
need
is
an
empirical
perform
a
terrible
approach
to
performance
data.
That
is
what
we
want
to
do
is
actually
benchmark
our
target
machines
so
that
we
actually
characterize
how
many
flops
in
our
sorry
company
what
the
peak
flop
rate
is
the
machine
and
what
the
peak
bandwidth
is
of
the
machine.
By
the
same
extension,
we
want
an
empirical
approach
to
data
application
characterization.
B
That
is,
we
want
to
know
how
many
flops
removed
and
how
many
bites
removed
not
on
a
theoretical
basis,
but
on
an
empirical
observation
basis.
So
to
answer
the
first
question
several
years
ago,
LBL
developed
what
was
called
the
empirical
roofline
toolkit.
This
is
a
way
we
characterize
CPU
or
GPU
accelerated
machine.
It
gives
us
the
peak
flop
rates
of
the
machine
and
the
bandwidth
at
each
level
of
memory
of
the
memory
hierarchy.
B
It
was
written
with
MPI,
plus
openmp
and
CUDA
that
allows
us
to
run
on
multiple
GPUs
on
a
multi
GPU
accelerated
node
architecture.
So
we
could
run
that
on
the
quarry
K&L
machine.
We
get
one
set
of
data,
we
get
a
dram
roofline,
but
we
can
also
use
the
same
tool
to
construct
an
l2
roof
line
or
an
l1
roof
line,
knowing
what
the
bandwidth
is
of
the
target
machine
by
the
same
extension,
and
we
could
actually
run
it
on
a
SONET
dev.
This
was
a
few
years
ago.
B
B
So,
let's
get
now
to
the
next
question
of
theoretical
versus
empirical,
let's
think
about
how
we
visualize
this
well,
we
may
have
this
theoretical
model.
This
is
the
quoted
flop
rate
of
a
GPU
or
CPU
and
the
quoted
bandwidth
of
a
GPU
or
CPU.
If
we
actually
go
ahead
and
run
ERT,
we
are
almost
invariably
gonna
get
lower
bandwidth.
We
will
almost
invariably
get
a
lower
flop
rate.
That
means
that
our
dot,
even
though
it
hasn't
actually
moved,
is
now
closer
to
the
nominal
roofline
limit.
B
Second,
we
can
think
about
how
we
actually
go
about
measuring
the
no
bra
flops
in
our
code.
That
is,
our
code,
might
have
things
like
divide
instructions.
Well,
most
instruction
set
architectures,
don't
actually
incorporate
a
divided
instruction,
but
map
and
divide
into
a
sequence
of
floating-point
instructions.
That
means
the
total
number
of
instructions
you're
executing
is
higher,
and
thus
your
empirical
flop
rate
is
higher
than
what
you
might
have
calculated
by
simply
looking
at
your
loop
nest
and
counting
flops.
B
B
The
second
we
can
think
about
what
happens
when
we
include
all
the
cash
effects
or
all
the
the
data
movement
effects.
That
is
when
we
go
to
actually
measure
how
much
data
we
move.
We
don't
want
to
just
look
at
our
loop
nest
and
calculate
flops
or
bytes.
We
want
to
think
about
how
many
bytes
were
actually
moved
to
them
from
the
memory
subsystem
in
some
cases
due
to
cache
effects.
B
This
may
be
actually
quite
high,
and
we
must
might
thus
see
a
decrease
in
arithmetic
intensity,
which
may
actually
make
us
very,
very
close
to
the
nominal
roofline.
So
in
this
particular
case,
just
as
a
recap
using
the
empirical
roofline
toolkit
or
some
other
benchmarking
technique
lowers
the
model,
it
brings
the
roofline
the
lines
themselves
closer
to
the
applications
characteristics
well.
Similarly,
measuring
the
air
applications.
Actual
data
movement
and
actual
flops
gives
us
a
true
sense
of
how
close
it
is
to
the
real
machine
capabilities.
B
So
the
next
aspect
of
why
we
might
be
below
the
roofline
is
centered
around
the
cache
hierarchy.
That
is,
we
may
have
bottlenecks
in
the
cache
that
are
actually
more
constraining,
and
even
so,
if
we
think
about
our
memory
hierarchy,
we
have,
in
this
case
sea
view
we
might
have
registers
in
the
CPU
core
itself.
We
have
an
l1
and
l2
and
l3
and
and
DRAM
to
where
we
have
locality
at
each
of
these
levels.
This
means
that
we
have
an
Associated
bandwidth
at
each
of
these
levels.
B
We
also
have
an
Associated
machine
balance
at
each
of
these
level
that
it
is
for
each
of
those
levels.
We
have
the
peak
flops
of
the
machine
divided
by
the
peak
bandwidth
at
that
particular
level.
By
corollary,
we
still
also
have
an
Associated
data
movement
for
each
of
our
applications,
so
for
each
of
our
loop
nests
at
each
of
these
levels,
that
is
a
given
lutenist
will
have
a
unique
number
of
l1
bytes
or
2
bytes
l3
bytes
and
D
Ram
bytes.
B
That
means
that
a
given
loop
nest
also
has
a
unique
arithmetic
intensity
for
each
level
of
the
memory
hierarchy.
That
is,
you
have
for
your
loop
nest.
You
have
an
l1
intensity
and
l2
intensity,
l3
and
DRAM
intensity.
That
means
you
can
think
about
how
we
might
extend
our
nominal
reply
model.
That
is,
if
we
think
back
to
our
original
equation,
we
might
define
AI
with
a
subscript
based
on
which
level
it
is.
We
could
then
add
an
additional
level,
so
this
basically
says
our
attainable.
B
How
do
we
go
about
route
visualizing
this?
Well,
we
could
think
about
having
a
figure
where
we
try
to
visualize
15
different
bounds
on
15
different
figures,
but
we
actually
it's
actually
much
much
easier
to
actually
plot
them
on
a
single
figure.
So
in
this
particular
case
we
have
what
is
called
a
hierarchical
roofline
model.
B
So
we
start
out
with
our
original
roofline,
which
has
the
HBM
bound
and
the
peak
flop
bound,
but
we
can
also
add
an
additional
bound
based
on
the
l
2k
ash
bandwidth
associated
with
the
L
2
is
in
the
L
2
intensity
for
our
given
application.
So
for
our
application
for
our
loop
nest.
Remember
these
two
dots
are
exactly
the
same
loop
nest.
It
just
happens
that
this
one
is
the
AI
or
the
L
2.
This
one
uses
the
AI
for
DRAM.
B
The
thing
is
is
that
we
can
never
have
two
different
performance
numbers
for
a
given
lutenist.
That
means
that
what
we
will
actually
observe
is
for
a
given
loop
nest.
They
will
always
have
the
same
y-coordinate,
but
they
will
have
different
x-coordinates.
That
is,
you
have
an
x-coordinate
for
your
l2
intensity
and
an
x-coordinate
for
your
HBM
intensity.
In
this
particular
case,
what
we
observe,
because
we
are
bound
by
l2
bandwidth,
we
will
see
that
the
DRAM
intensity,
the
DRAM
performance,
is
well
below
the
DRAM
bandwidth
of
the
associated
machine.
B
We
could
also
imagine
a
similar
case
where
we
have
reversed
things
and
we
actually
have
much
higher
l2
locality.
Now
there
are
a
few
things
to
actually
observe
when
we
actually
use
the
hierarchical
roofline
model.
When
you
see
the
x
coordinates
of
your
Luton
s,
AI,
that
is,
if
l2
AI
is
very,
very
different
than
DRAM
a
tie
that
says
that
you
actually
have
very,
very
high
reuse
in
the
l2
cache.
So
in
this
particular
example,
we
are
moving
orders
of
magnitude,
more
bytes
to
and
from
the
l2
than
we
do
from
d-ii
Ram.
B
That
says
that
we're
getting
really
really
good
cache
locality
in
the
l2
and
only
a
few
bytes
actually
have
to
trickle
out
all
the
way
to
DRAM.
Conversely,
we
could
imagine
running
a
different
loop
Ness,
where
we
actually
have
no
reuse.
That
is
every
time
we
move
a
byte
to
and
from
the
l2
we
end
up
moving
a
byte
to
and
from
Dhiru
that
basically
says
that
the
l2
is
doing
nothing
for
us.
B
It's
not
doing
any
kind
of
bandwidth
filtering,
it's
not
doing
any
latency
filtering
and
all
we're
doing
is
streaming
data
through
the
l2.
So
when
those
when
the
AIS
are
widely
separated,
we
have
high
reuse
when
they
are
very,
very
close
together,
we
have
no
reuse.
Having
no
reuse
is
not
necessarily
a
good
thing,
because
it
says
that
you're
not
really
making
good
use
of
that
inherent
cache
architecture,
that's
in
every
CPU
and
GPU.
You
really
want
to
be
in
that
scenario
where
those
AIS
are
widely
separated.
B
So
the
third
aspect
of
why
we
might
be
below
the
roofline
centers
around
in
core
FX.
This
is
really
geared
towards
the
instruction
set.
Are
we
using
fuse
multiply,
add
vectorization,
tensor
cores
so
vectors
by
themselves?
Have
their
limits
that
is
a
vector?
Has
a
finite
applications,
have
a
finite
amount
of
data
level
parallelism
when
you
use
a
vector
machine,
the
register
file
energy,
basically
scales
with
a
vector
length.
There
are
a
number
of
other
constraints
that
say
vectors
eventually
taper
out
in
terms
of
their
performance.
The
death
Moore's
law
is
really
reinvigorating.
B
Some
facets
of
complex
instruction
set
computing
you're,
not
gonna,
get
back
to
the
kind
of
complicated
load
architectures,
where
you're
mixing
loads
in
compute
I
think
the
the
load
store
architectures
are
here
to
stay,
but
what
you
will
get
our
very,
very
complicated
compute
instructions.
So
this
started
out
by
having
fuse
multiply,
add
instructions
where
you
have
a
single
instruction
that
takes
two
values,
multiplies
one
of
them
and
then
adds
it
to
the
second
one,
storing
it
into
a
third
that
can
be
extended,
obviously
into
a
vector
version.
B
You
can
then
go
from
that
version
into
what
is
called
quad
FMA
that
occurred
in
the
x86
instruction
sets.
These
are
basically
matrix,
vector,
multiplications
in
a
single
instruction
and
then
on
GPUs.
You
have
tensor
core
instructions
where
you
might
have
multiplication
by
two
small
matrices
adding
to
a
third
matrix
in
all
of
these
cases.
These
are
a
single
instruction
or
could
be
a
limited
number
of
instructions
that
do
a
large
large
number
of
operations.
But
this
means
that
instructions
are
now
going
to
be.
B
The
instructions
in
an
application
are
really
a
mix
of
scalar
instructions
which
could
be
predicated
on
a
vector
machine,
vector
instructions,
matrix
operations,
and
that
means
that
performance
is
now
going
to
be
a
weighted
average
of
all
these
different
types
of
instructions.
That
is
a
scalar
instruction,
might
only
do
one
floating-point
operation,
a
vector
instruction,
might
do
32
operations,
a
tensor
instruction
might
do
128
floating-point
operations.
You
have
to
add
all
of
those
up
to
understand
whether
you're
getting
good
performance.
B
So
if
we
consider
something
like
a
voltage
GPU,
we
have
ballpark
100
teraflops
of
FP
16
sensor
performance.
We
have
something
like
only
15
teraflops
of
FP
32
performance
and
if
we
get
rid
of
FMA,
we
only
have
something
in
the
seven-and-a-half
teraflops
of
FP
32,
add
performance,
any
kind
of
deep
learning,
applique
we'll
be
a
mix
of
tensor
operations,
ft
16
operations
and
FB
32
operations.
B
That
means
that
your
deep
learning
performance
may
be
well
below
the
nominal
tensor
core
peak,
because
it's
having
to
average
together
instructions
that
are
FP
32
ads
at
pwf,
mas
and
FB
16
WMAs.
In
essence,
the
mix
of
the
actual
instructions
imposes
an
effective
ceiling
on
performance,
and
the
real
question
then
becomes:
how
close
are
you
to
that
effective
ceiling
on
performance?
B
The
fourth
aspect
is
FPU
starvation.
That
is,
we
have
assumed
to
date
that
the
FPU
we
can
it's
just
a
question
of
how
fast
we
can
give
instructions
to
the
FPU
and
that's
our
gonna,
be
our
limiting
factor.
Modulo
data
locality,
but
the
reality
is,
is
that
processors
have
finite
instruction
decode
issue
a
bandwidth
which
means
that
the
number
of
floating-point
is
units.
The
number
of
ft
use
dictates
the
fu
rate
required
to
actually
hit
that
peak
performance
number.
B
The
ratio
of
those
two
is
the
ratio
of
floating-point
instructions
required
to
actually
hit
that
peak
performance
number.
So,
let's
consider
an
example,
let's
say
we
have
some
four
four
issues:
superscalar
CPU
with
two
floating-point
data
paths.
That
means
at
least
50%
of
our
instructions
have
to
be
floating-point
to
have
any
chance
of
getting
peak
performance.
If
we
have
only
25%
of
floating-point
and
let's
say,
75%
integer,
our
performance
will
can
never
exceed
50%
of
peak
and
it
falls
progressively
from
then.
B
So
if
we
have
applications
that
are
dominated
by
internet
instructions,
we
have
to
really
take
this
into
account
because
we
are
not
going
to
be
a
compute
limited
for
those
classes
of
applications.
In
the
worst
case,
we
might
have
an
architecture
that
has
two
floating-point
data
paths.
That
is
only
two
ways.
Superscalar
in
that
particular
case,
you
might
need
a
hundred
percent
of
your
instructions
to
the
floating-point
to
have
any
chance
of
getting
peak
performance,
which
is
basically
never
going
to
happen.
B
If
you're
in
that
regime-
and
you
have
only
25%
of
your
instructions
being
floating-point,
you're
gonna
be
getting
a
very,
very
low
fraction
to
peak
performance,
even
if
you
are
well
past
the
machine
balance,
so
this
gives
rise
to
a
different
version
of
roof
language,
which
is
how
do
we
think
about
roofline,
not
geared
around
floating-point
operations
but
geared
around
instructions?
In
this
case,
this
is
the
instruction
roofline
model.
I
have
the
reference
here
to
the
paper
that
we
did
last
year.
So
how
do
we
go
beyond
the
flock
centered
roofline?
B
That
is
when
your
application,
as
we
could
think
about
how
we
might
classify
applications.
We
have
those
heavy
floating-point
applications,
that's
actually
kind
of
rare
within
VOE.
We
have
applications
that
are
a
mix
of
integer
and
floating-point
operations
that
that's
more
common.
But
then
we
have
these
emerging
classes
of
applications
from
bioinformatics
or
graph
algorithms,
where
they
may
be
integer
only
computations,
that
is,
they
have
no
floating-point
operations.
If
you
have
no
floating-point
operations,
your
arithmetic
intensity
is
zero
and
you
can
never
even
plot
a
zero
arithmetic
intensity
on
a
log-log
roofline
model.
B
The
other
aspect
is
a
different
way
of
dealing
with
mixed
precision
codes,
where
they're,
rather
than
thinking
about
how
you
do
a
weighted
average
of
flops,
you'd.
Think
about
how
many
instructions
you're
executing
so
I
will
note
that
one
way
that
Intel
Advisor
did
dealt
with
this
is
they
went
from
just
doing
floating-point
operations
per
second
to
integer
operations
or
flops
plus
integer
operations
per
second,
which
is
useful
when
you
want
to
understand
performance
as
operations
per
second
rather
than
bottlenecks
in
the
machine
of
instructions
per
second.
B
So
what
we
really
wanted
at
that
point
was
a
instruction
roofline
model,
not
an
integer
operation,
roofline
model,
so
the
most
basic
way
of
doing
this.
This
is
on
a
simpie
machine.
You
might
consider
a
vector
my
crops
instead
of
flops,
the
vector
my
crops
can
be
easily
mapped
to
any
kind
of
vector
unit
utilization.
B
The
other
advantage
is
that
when
we
deal
on
CPUs,
most
of
our
performance
counters
don't
give
us
full
box,
but
they
actually
give
us
vector
microbes,
which
makes
it
an
easier
transition
to
constructing
a
roofline
model.
The
thing
to
keep
in
mind
is
just
because
you
have
full
utilization
of
your
vector
unit
does
not
imply
full
peak
performance,
because
peak
performance
assumes
that
you
did
FMA,
you
did
vector
operations,
you
did
tensor
operations
well
vector
unit
utilization.
Just
says
the
vector
units
are
busy
all
the
time
they
could
be
busy
doing
inefficient
instructions.
B
So,
in
this
particular
case
you
know
we
might
start
out
with
the
traditional
roofline
model,
which
has
bandwidth
and
flops.
We
have
a
nominal
arithmetic
intensity
associated
with
it
and
a
performance
well
below
that
number.
We
can
think
about
moving
to
a
vector,
microwatt
version.
We
might
have
the
same
bandwidth,
but
now
we
have
a
peak
vector
my
crops
per
second
rather
than
peak
flops
per
second.
This
is
basically
taking
how
many
operations
are
in
an
instruction
and
divide
it.
You
know
this
means
we
have
a
potentially
different
machine
balance.
B
We
have
a
potentially
different
AI
associated
with
the
number
of
instructions
for
the
number
of
micro
ops
that
were
actually
executing.
When
we
actually
look
at
that
version
of
roofline,
we
may
actually
get
be
getting
a
very,
very
high
fraction
of
roofline
of
the
mic
or
Rock
roofline,
rather
than
the
nominal
flop
roofline.
B
So
the
question
then
becomes:
how
do
we
take
this
kind
of
formulation
and
apply
it
to
an
NVIDIA
GPU?
Well,
we
might
not
have
vector
micro
ops.
We
probably
have
warp
instructions
instead,
but
then
the
question
becomes:
do
we
want
to
do
instructions
per
byte
or
something
else?
This
gets
into
the
question
of
what's
an
instruction
on
the
GPU.
If
you
do
the
more
thread,
centric
version.
Well,
then
you
hide
some
of
the
issue
limits.
If
you
do
the
more
warp
centric
version,
then
you
hide
some
of
the
predication
effects.
B
The
solution
was
basically
to
scale
the
number
of
non
predicated
thread.
It
worked
sized
ie
divided
by
32
and
show
it
in
terms
of
warp
instructions
per
second.
You
can,
of
course,
then
break
the
sound
into
subclasses,
just
integer
FP
32
load/store,
whatever
to
understand
bottlenecks
in
individual
functional
units
rather
than
bottlenecks
at
the
instruction
issue,
the
Warped
issue
rate
so
naively
one
might
think
you
ought
to
use
bytes
and
that
would
match
the
existing
roofline
quite
well
when
thinking
about
intensity.
B
That
is,
if
we
did
instructions
per
byte
bets,
are
our
direct
translation
from
our
original
flops
per
byte.
But
the
reality
is
GPUs
access,
memory
using
transactions
and
those
transactions
might
be
32
bytes
for
global
local
memory.
They
might
be
128
bytes
for
shared
memory,
so
we
ended
up
the
deciding
to
use
instructions
for
transaction
as
the
means
of
understanding
both
machine
balance
and
application
intensity.
This
preserves
the
kind
of
traditional
concepts
of
the
roofline
model,
but
it
actually
ended
up
allowing
us
to
think
of
new
ways
of
understanding
memory
access.
B
So
this
means
that
we
start
out
with
our
original
flop.
Centric
roofline.
If
we
have
integer
heavy
codes,
we
want
to
actually
transform
this.
We
think
of
it
as
gig
instructions
per
second,
with
some
kind
of
instruction
intensity
rather
than
arithmetic
intensity.
And
then
we
can
modify
that
to
be
both
warp
instructions
and
think
about
how
we
would
map
this
if
we
actually
dealt
with
transactions
instead
of
bytes.
This
means
that,
for
the
instruction
roofline
model,
we
have
the
peak
instruction
rate
of
the
machine.
B
We
can
then
basically
plot
this
for
roofline
for
a
voltage
GPU.
We
get
these
numbers,
it's
just
a
different
way
of
trying
to
analyze
application
performance,
but
what
it
really
allowed
us
to
do
is
think
about
how
we
think
about
global
memory
access
differently,
so
rather
than
thinking
of
total
instruction
intensity,
total
number
instructions
divided
by
total
transactions.
If
we
think
specifically
about
load
store
instructions
divided
by
global
transactions,
we
have
a
very
special
meaning
in
this
particular
case.
This
allows
us
to
understand
the
efficiency
of
global
memory
access.
B
We
actually
can
observe
that
there
are
three
very
important
intensities
load,
store
instructions,
/,
look
global
transactions,
basically
mapping
to
what
our
memory
access
pattern
is
like.
If
we're
doing
fully
random
access,
where
every
thread
in
a
warp
is
just
accessing
a
random
location
in
memory,
we're
basically
doing
the
same
thing
as
if
we
were
striding
by
greater
than
hundred
28
bytes,
that's
the
minimum
intensity
we
can
ever
have
for
load
store
intensity.
B
Conversely,
if
all
the
threads
in
a
warp
access
exactly
the
same
memory
location,
the
thing
only
a
single
transaction
is
required,
and
thus
we
can
have
a
very,
very
high
intensity
of
one
instruction
per
transaction
somewhere
in
between.
We
have
the
unit
stride
memory,
access
pattern,
where
our
warp
just
walks
through
memories,
sequentially
the
threads
in
a
warp
access
to
point
consecutive
memory
elements.
B
We
have
the
unattainable
a
low
to
the
left
and
the
unattainable
a
high
to
the
right,
and
if
we
actually
talk
a
lot
or
applications
intensity,
we
may
actually
see
that
our
applications
are
actually
accessing
memory
and
global
memory
inefficiently,
that
is
out
of
the
box.
Our
application
may
be
accessing
memory
close
to
a
random
access
pattern
through
some
optimization
efforts.
We
really
want
to
think
about
how
we
can
transform
that
intensity
to
improve
it,
to
get
it
close
to
the
unit
stride
intensity.
B
We
can
do
the
same
exercise
in
shared
memory
and
think
about
how
that
same
kind
of
concept,
shared
load,
store
instructions
divided
by
shared
transactions
allows
us
to
quantify
the
number
of
bank
conflicts
that
are
actually
occurring.
That
is,
if
all
the
threads
in
a
warp
access
all
this
same
Bank
in
shared
memory,
you're
going
to
get
a
32
Way
Bank
conflict
and
you're
gonna
generate
32
transactions.
That's
really
low
performance.
B
Conversely,
if
all
the
threads
in
a
warp
access
a
different
location
in
memory,
a
different
Bank
in
shared
memory,
you'll
only
generate
the
one
transaction
and
you'll
get
high,
shared
load
store
intensity.
So
at
the
same
way
we
can
think
about
plotting
our
applications
intensity.
We
can
think
about
how
optimization
may
improve
intensity.
So
if
we
look
at
a
smith-waterman
type
example,
we
may
actually
observe
in
the
naive
implementation
we
actually
get
kind
of
modder
for
it
a
instruction
throughput.
B
If
we
think
about
what
the
actual
load
store
global
load
store
efficiency,
is
we
actually
see
that
it's
actually
rather
poor?
That
is
we're
doing
almost
random
access
in
the
naive
implementation
and,
conversely,
we
may
have
no
bank
conflicts.
The
optimized
implementation
may
a
do
memory
coalescing.
This
allows
us
to
move
from
a
strided
memory,
access
to
a
unified
memory
access
and
thereby
to
get
better
performance.
Once
again,
the
number
of
bank
conflicts
didn't
really
change.
B
B
For
the
instruction
roofline,
the
traditional
roofline
is
really
about
telling
us
about
performance.
The
kind
of
use
of
FMA
cyndi
vectors
really
has
no
effect
on
intensity,
but
it
can
increase
performance.
What
the
instruction
roofline
does
is.
It
tells
us
about
bottlenecks,
whether
those
bottlenecks
be
in
the
issue
rate
or
in
memory
we
can
use
the
any
kind
of
use
of
FM,
AMD
or
vector,
actually
decreases
intensity
and
may
actually
decrease
in
performance.
Any
kind
of
integer
instructions
may
actually
increase
instruction
intensity
and
increase
instruction
throughput
and
then.
B
B
One
of
the
other
ways
you
can
be
below
the
roofline
is:
you
are
under
utilizing
the
parallelism
of
the
machine.
So
if
we
think
about
running
a
traditional
thread,
scaling
experiments
on
a
CPU,
we
may
for
different
problem
sizes,
scale
up
the
number
of
threads
and
observe
the
differences
in
flop
rates.
Remember
this
is
a
log-log
scale,
so
we
can
actually
see
where,
as
the
blue
problem
size,
saturates
in
performance,
the
green
problem
size
actually
falls
over
in
performance.
B
The
problem
is:
is
that
this
kind
of
formulation,
this
kind
of
way
of
thinking
about
thread
scalability,
doesn't
really
tell
us
anything
about
what
went
wrong.
Why
did
the
green
a
problem
size
actually
see
a
turnover
in
performance
and
see
lower
performance
as
we
increase
the
number
of
threads,
so
one
of
the
things
Chloe
brahim
did
was
to
actually
take
roofline
and
use
it
to
understand,
process
or
thread
scalability.
That
is
basically
you're.
B
Doing
a
2d,
scatter
plot
a
trendline
function
to
understand
how
performance
and
arithmetic
intensity
changes
with
thread
concurrency,
so,
whereas
the
blue
line
in
this
case
may
actually
see
substantial
increases
in
performance,
although
through
every
different
concurrency
between
1
2,
4,
8,
16,
64
threads,
what
we
actually
observe
is
it's
actually
losing
instruction
etic
intensity,
that
is,
the
arithmetic.
Intensity
is
starting
to
to
wane.
Conversely,
the
green
and
red
problem
sizes
see
ideals,
scaling,
that
is
for
a
range.
B
This
can
also
be
applied
to
other
Nass
benchmarks.
It
can
be
used
to
understand
the
difference
between
open
ACC
and
CUDA,
and
I
will
point
people
to
this
paper
from
bench
from
last
year,
which
actually
won
a
best
paper
for
understanding
the
differences
between
these
different
programming
models
on
different
NASA
parallel
benchmarks.
So
to
provide
a
recap,
what
roofline
is
really
doing
is
its
bounding
performance
as
a
function
of
arithmetic
intensity.
That
is
roofline
itself
has
those
horizontal
lines.
Those
are
the
compute
ceilings.
It
has
the
diagonal
lines.
B
B
You
have
arithmetic
intensity,
which
is
going
to
be
unique
for
each
lutenist.
It
is
unique
for
each
level
of
memory,
and
it
is
that
measure
of
data
locality,
the
measure
of
temporal
locality.
It
is
the
ratio
of
the
total
number
of
flops
that
your
lutenist
performs
divided
by
the
total
bytes.
Your
loop
nest
actually
moves
when
we
plot
on
the
roofline.
Every
loop
has
one
dot
per
level
of
member
of
the
memory
hierarchy.
B
So
if
you
have
ten
major
loops
and
four
levels
of
memory
hierarchy,
you
have
40
dots
that
you
might
have
to
to
plot
more
than
likely
you'll
only
plot
a
subset
of
those.
At
a
time
you
might
plot
only
the
DRAM
once
at
a
time
you
might
plot
all
of
the
four
levels
for
a
single
loop
nest
at
one
time
that
cuts
down
on
how
much
data
you're
actually
having
to
visualize
when
one
of
those
dots
is
close
to
the
ceiling
that
indicates,
you
are
likely
seeing
a
performance
bound.
B
The
position
of
those
dots
relative
to
each
other
within
a
loop
nest
is
indicative
of
the
cache
locality.
That
is
remember
if
your
given
lutenist,
if
your
four
dots
for
l1,
l2,
l3
and
drm,
are
widely
separated.
That
means
you're
getting
great
cache
locality
if
they're
all
bunched
together.
That
basically
means
you're
streaming
data
to
cache.
All
of
these
concepts
apply
equally
to
any
kind
of
GPU
or
other
accelerator.
B
So
what
do
we
use
fruit
line
for?
Well,
there's
the
obvious
thing
of
using
it
to
understand
the
differences
between
architectures
programming
models
and
implementations.
That
is
why
do
some
architectures
or
implementations
perform
better
than
others?
Why
do
some
compilers
perform
better
than
others,
but
it's
also
useful
for
understanding
and
predicting
the
performance
on
future
machines.
That
is.
This
allows
us
to
set
realistic
performance
expectations
and
focus
on
where
we
actually
need
to
drive
a
few
future
architectures.
That
is
in
some
cases
we
want
more
bandwidth.
B
It's
also,
of
course,
it
useful
for
understanding
performance
bottlenecks
in
trying
to
motivate
software
optimization.
But,
finally,
it's
really
good
for
determining
when
we're
done
optimizing
code.
When
you
are
close
to
that
roofline
limit,
you
really
need
to
think
about
how
you
make
algorithmic
changes
to
move
forward,
because
you're
really
not
gonna,
make
substantial
increases
in
performance
when
you're
already
within
90
percent
of
the
roofline
limit.
B
At
the
same
time,
you
can
imagine
taking
your
performance
today,
your
performance
a
month
from
now
your
performance
three
months
from
now
and
plotting
it
all
on
the
same
roofline
figure,
you
can
see
a
resultant
trajectory
and
see
how
you're
actually
approaching
the
roofline
limit.
I
will
say
that
the
model
itself
is
just
one
piece
of
the
puzzle.
It
defines
the
basic
concepts
and
the
basic
equations,
but
at
the
same
time
you
have
to
have
system
characterization
that
really
defines
where
the
lines
are
where
the
dots
are.
B
You
have
the
application
characterization
to
define
the
dots,
and
then
you
have
some
kind
of
visualization
and
analysis
tool
and
the
remainder
of
this
tutorial
Sharleen
will
demonstrate
how
to
construct
the
roofline
model
on
NVIDIA
GPUs
focus
really
on
system
characterization.
An
application
characterization
max
will
demonstrate
how
to
use
insight
compute
to
automate
the
roofline
collection.
This
includes
the
GPU
benchmarking,
the
application,
characterization
and
integrated
visualization,
and
then
you
will
go
ahead
and
use
insight
compute
to
analyze
your
own
individual
applications.
B
B
C
B
C
B
Insight
will
we'll
always
probably
implement
a
subset
of
what
it
is.
What
has
been
done
remember
there
is
the
kind
of
research
activities
which
are
kind
of
the
bleeding
edge.
You
know
they
go
ahead
and
think
about
new
ideas.
Think
about
new
concepts.
Some
of
those
pan
out,
some
of
them
may
not
pan
out.
Some
of
them
are
broadly
applicable.
Some
of
them
are
more
niche
inside
itself
will
take
most
likely
a
subset
of
the
and
incorporate
them
in
okay.
D
B
A
D
B
Sure
I
was
gonna
answer
the
one
more
question:
I
think
that
was
in
the
chat
window,
so
you
can
always
construct
a
operation
centered
roofline,
which
takes
together
both
integer
operations
and
floating-point
operations,
or
you
can
do
that
in
mixed
precision.
Nominally
though,
when
we
think
about
instruction
roofline
vs.
flop,
Reuters
flying
those
tend
to
be
separate
concepts,
we
can
think
about
floating-point
instructions
per
second
or
total
instructions
per
second
or
we
can
think
about
floating-point
operations
per
second.
A
A
So
after
the
you
know,
the
the
theory
talk
I
would
like
to
kind
of
go
through
the
practical
mechanism
as
to
how
to
collect
and
refine
data,
and
today
we're
really
just
focused
on
NVIDIA
GPUs,
like
Sam,
says
the
general
methodology,
the
roofline
model
works
for
all
architectures.
You
just
need
to
find
you
know
the
proper
metrics,
the
proper
tools
to
declare
the
relevant
data.
A
So,
and
the
goal
here
is
to
applaud
a
rough
line
like
this,
you
have
probably
multiple
memory
levels
on
the
architecture
and
different
data
Precision's
and
different
instruction
types.
You
may
be
executing
the
coup
de
coeur.
You
may
be
using
ten
circles
as
well,
but
essentially
we
want
to
have
a
you
know,
a
very
complex
roofline
like
this.
A
A
And
if
those
kernels
cannot
get
say,
you
know
that
the
peaks
that
we
see
from
the
white
paper,
then
you
know
we
really
have
to
consider
what
the
actual
runtime
environment
is.
Maybe
you
know
the
power
is
being
constrained
or
you
know
some
other
things
being
constraining
to
to
the
pink.
But,
but
by
doing
this
we
do
get
a
more
realistic
understanding
of
what
the
peaks
can
be,
because
if
those
micro
kernels
cannot
reach
the
advertised
peaks
and
then
we
cannot
expect
that
you
know
large
Segoe
HPC
applications
do
so.
A
That's
that's
the
whole
purpose
of
this
empirical
or
of
line
tool.
Kids
I'll
talk
a
bit
more
about
us
in
the
moment
that
the
first
step
is
to
you
know,
get
the
ceilings,
and
then
we
need
to
measure
the
basically
the
application
data
and
to
put
the
dots
on
the
roof
line
and
those
dots
have
two
coordinates
and
the
x
coordinate
is
the
arithmetic
intensity,
and
so
you
calculate
those
we
need
to
measure
the
flops,
which
is
the
total
number
of
floating-point
operations
carried
out
in
the
kernel
and
then
the
data
movement.
A
So
how
many
bytes
have
been
moved
for
a
particular
memory
level,
so
the
ratio
of
these
two
would
be
the
arithmetic
intensity
and
the
y-coordinate
is
the
the
fur
pitch
so
flops
per
second,
and
for
that
we
need
to
get
the
runtime
for
the
colonel,
so
basically
three
quantities.
We
have
to
measure
and
I
guess
here:
I
have
to
say
that
for
data
movement,
there's
number
of
bytes
KB
for
different
cache
levels.
So,
if
you're
looking
at
a
hierarchical
roofline,
then
you
need
to
measure
more
than
three
quantities.
A
But
the
methods
of
you
know
calculating
the
arithmetic,
intensity
and
the
performance
is
the
same.
So
after
getting
all
these
numbers,
we
need
to
plug
them
up
and
be
the
more
automatic
way
of
doing.
Those
is
to
you
since
I
compute,
and
we
have
a
few
section
files
you
can
use
to
to
plug
the
roofline.
But
also
you
know
the
section
files
can
collect
these
quantities
as
well
so
and
the
whole
workflow
is
automatic.
A
A
I'll
give
you
more
details
about
those
in
a
moment,
so
so
the
first
step,
you
know
we
can
use
that
there,
article
numbers
or
we
can
use
the
empirical
roof
line
toolkit
since
you
get
a
more
realistic
set
of
peaks,
and
so
this
plot
here
shows
how
yotie
works
so
basically
slopes
through
a
range
of
data
sets
and
it
measures
the
the
bandwidth
for
each
working
sets.
Also
the
flops
and,
depending
on
how
compute
intensive
mikono
you're
using
is,
you
could
be,
you
know,
getting
the
pink
bandwidth
or
the
pink
flops.
A
A
So
then,
how
do
we
collect
the
application
data?
The
the
manual
way
of
doing
this
is,
to
you
know,
use
insight
compute
to
collect
these
metrics
yourself.
I
have
listed
the
matrix
here,
also
the
scripts
we
have
in
this
repository.
I
use
exactly
the
same
metrics
as
well,
so
you
can.
You
can
also
you
know,
integrate
those
into
your
own
workflow,
and
this
should
produce
exactly
the
same
results
as
inside
compute.
A
A
A
A
The
I
think
previously,
we
have
also
published
a
set
of
metrics
from
q2
10,
also
using
a
second
page,
but
this
is
a
slightly
slightly
different
than
what
we
have
now
and
if
you're
using.
If
you
have
access
to
create
11,
we
would
really,
you
know,
recommend
you
to
try
the
new
matrix,
but
these
metrics
should
be
equivalent
to
each
other.
A
All
right
so
coming
to
the
last
part,
which
is
you
know
to
actually
plot
and
the
replying
charts
you
can
use
as
I
compute
you,
which
automatically
gets
a
brief
line
charge
like
this
one
thing
I
didn't
noticed
at
the
beginning,
is
that
you
know
the
the
roof
line.
Charge
is
one
per
current
one,
one
child
per
kernel.
So
if
you
don't
see,
you
know
the
relevant
kernel
only
charged,
you
may
want
to
go
to
this
drop
down
a
button
to
see
you
know
what
are
the
kernels?
You
have
profiles.
A
Of
course
you
can,
you
know
just
profile
the
kernel
you
wanted,
but
it's
just
you
know
something.
You
may
not
know
this
and
first
glance,
if
you
use
the
scripts
here,
we
have,
you
know
kind
of
put
all
the
dots
on
the
same
chart.
For
example
in
here
you
see
different
colors
for
different
kernels
and
then
different
markers
for
different
cache
levels,
but
these
these
scripts
are
really
for
the
example
we
have
in
the
repository,
which
is
the
GBP
kernel.
A
A
To
quickly
show
a
few
examples
of
and
hierarchical
flying
charts
we
have,
if
you
use
inside
compute.
This
is
a
very
kind
of
typical
gem
example
using
tens.
Of
course,
we
have
five
different
kernels
in
in
this
code
and
using
inside
compute
you
can
see
all
the
kernels
have
been
profiled
I
believe
these,
so
you
know
just
two
different
indications
of
the
same
kernel,
but
you
can
see
that
we
have
three
different
dots
representing
the
the
performance
for
different
cash
levels.
L1
note
you
and
HBM
the.
A
A
A
A
A
A
So
this
is
based
on
the
kernel
count.
This
is
based
on
the
flops
performed
in
that
criminal,
and
so
all
these
plots
have
seen
all
these
dots
below
the
single
precision
pick,
which
is
about
1415
teraflops,
and
when
you
have
a
different
setting
for
the
code,
you
know
we
can
have.
You
know
tens
of
cool
kernels
as
well,
so.
A
So
there's
these
scripts
can
be
really
customized
to
to.
You
know
satisfy
all
your
platini's
that
you
just
need
to
do
a
little
work.
The
the
scripts
we
have
in
the
repository
are
really
the
basic
most
basic
ones,
and
of
course
you
can,
you
know,
employed
the
the
whole
optimization
paths,
so
you
know
step
one.
We
have
the
performance
here
and
step
two
and
as
we
optimize
the
the
kernel
further,
we
were
seeing
performance
going
up
and
up
arithmetic
intensity
also
changing
between
different
steps.
A
So
the
slice
will
be
ready.
I
have
posted
my
hands
them.
Seven
max
will
have
some
too
right,
so
all
of
them
should
be
available
after
the
event
and
then
I
see
three
other
questions.
How
to
embed
our
work
flows
into
new
scripts.
I
think
we'll
get
into
more
details
about
this,
but
I
can
quickly.
A
We
have
this
GP
Pecos.
This
is
the
input
file
for
the
GPB
coach
and
we
have
two
job
scripts.
One
is
using
answer
compute.
So
using
this
you
can
collect
the
profile.
The
profiles
which
can
be
opened
using
the
insite
continuing
the
the
other
script
run.
Dot
customized
is
using
the
matrix
I
mentioned
and
it
is
you
know
the
more
customized
way
of
collecting
matrix.
This
is
using
the
command
line
of
inside
compute.
You
won't
get
a
kind
of
a
visual
and
profile
and
to
embed
your
workload
into
these
scripts.
A
A
E
A
The
spec
I
believe
so,
and
are
you
trying
to
profile
these
kernels
these
benchmarks.
F
Yeah
so
I
think,
like
that's,
my
yeah
I
will
be
trying
some
of
them,
so
I
wonder
like
if
we
have
I
think
they
required
some
sort
of
license.
So.
A
A
Any
other
questions:
next,
do
you
see
anything
on
this
slack
channel
isn't
empty?
Asking
me.
A
Okay,
then
I
guess
we're
dude
for
a
break,
we'll
be
back
in
15
minutes
and
max
will
be.
You
know
talking
about
the
examples
there.
He
just
said
alright
see
you
guys
in
a
bit.