►
From YouTube: Introduction to the Roofline Model
Description
Samuel Williams of LBNL presents a talk on Introduction to the Roofline Model. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Due to some data loss, this recording is missing the start of the talk. Session Chair: Yan Zhang.
A
This
forms
the
basis
of
what
we
call
a
dram
roofline
model,
which
is
kind
of
the
simple
version
of
compute
and
dram.
So
this
basically
boils
down
to
answering
the
question
which
takes
longer.
Does
it
take
longer
to
move
your
working
set,
your
vectors,
your
matrices
from
dram
to
the
processor
or
dram
to
the
gpu,
or
does
it
take
longer
to
actually
compute
on
them?
A
This?
We
can
basically
come
up
with
a
very
simple
equation
that
says
the
actual
run
time
for
your
loop
nest
is
going
to
be
the
maximum
of
either
how
long
it
takes
to
compute,
which
is
the
number
of
floating
point
operations
you
execute
divided
by
the
peak
flop
rate
or
the
time
that's
required
to
move
the
data,
which
is
how
big
are
your
your
data
sets
and
how
fast
can
you
actually
move
them?
What
is
your
peak
memory
bandwidth
now?
A
A
A
Of
course,
unfortunately,
arithmetic
intensity
came
about
in
the
the
early
2000s
when,
when
I
was
writing
this
as
part
of
my
early
2000s
or
seven
as
I
was
writing
my
thesis
and
it
has
nothing
to
do
with
artificial
intelligence
now
I
apologize
if
I
use
the
term
ai
and
it
can
be
confusing,
but
when
I
use
that
in
this
context
it
will
always
be
arithmetic
intensity.
What
it
really
is
is
going
to
be
this
measure
of
data
locality.
A
It's
total
flops
that
you
perform
to
total
bytes
that
you
move
for
your
loop
nest
for
the
dram
roof
line.
This
is
the
total
number
of
bytes
that
you
move
to
and
from
dram
and
thus
will
include
any
kind
of
cache
effects,
any
kind
of
pre-fetcher
effects
and
will
almost
invariably
be
different
than
the
total
number
of
loads
and
stores
that
you
request
from
the
memory
subsystem.
A
A
So
how
do
we
visualize
this
rootline
model?
Well
we're
going
to
take
this
equation,
this
flop,
being
the
minimum
of
peak
flops
and
the
product
of
arithmetic,
intensity
and
peak
bandwidth,
and
we
want
to
plot
it
on
a
log
log
scale.
Why
do
we
do
that
for
one?
It
makes
it
extremely
easy
to
doodle
things
on
a
whiteboard
and
two.
It
makes
it
very
easy
to
make
extrapolations
for
moore's
law
where
performance
used
to
double
every
few
years.
A
So
how
do
we
do
this?
Well
vertical
axis?
Is
the
attainable
flop
rate
for
our
loop
nest?
The
y-axis
is
the
arithmetic
intensity
for
our
loop
nest.
What
we
see
is
one
term
in
this
equation
is
the
peak
flop
rate
of
the
machine.
The
other
term
in
a
log
log
scale,
is
this
product
of
arithmetic
intensity
and,
in
this
case,
gpu
hbm
bandwidth,
the
roof
line.
The
minimum
function
says
that
we
must
be
on
or
below
this
curve.
A
A
So
it's
very
important
to
keep
this
in
mind,
because
each
machine
will
have
a
different
machine
balance.
If
you
simply
increase
the
flop
rate,
it
may
not
actually
improve
your
application
performance.
If
you
increase
bandwidth,
it
may
not
increase
your
application
performance.
It
all
boils
down
to.
Where
does
your
loopness
arithmetic
intensity
lie?
With
respect
to
this
inflection
point.
A
A
You
have
this
other
region
of
being
less
than
the
machine
balance,
but
greater
than
the
product
of
ai
and
peak
bandwidth.
Once
again,
this
is
unattainable
performance,
basically
you're
you
could
think
of
it
as
you're
having
to
move
data
faster
than
the
speed
of
light.
It's
not
going
to
happen.
A
You
also
have
this
region
here,
where
you
are
less
than
the
machine
balance,
but
you're
also
getting
close
to
the
machine
bandwidth
in
this
case,
I'm
showing
like
a
50
of
machine
bandwidth
in
that
region,
you're,
basically,
memory
bandwidth
bound.
You
have
about
a
factor
of
two
to
go
to
improve
performance,
but
that's
it
correspondingly.
A
There
is
a
compute
bound
region
where
you're
getting
at
least
half
of
your
peak
flop
rates,
but
you're
also
greater
than
the
machine
imbalance
and
then
finally,
there's
this
other
region
where
you
are
below
either
peak
flops
or
below
half
of
peak
bandwidth.
In
this
case
we
might
deem
as
just
poor
performance.
That
is
we
don't
we
don't?
We
want
to
get
ourselves
out
of
this
region
and
into
either
the
blue
or
pink
region,
but
we
can
never
be
above
the
the
solid,
blue
or
solid
pink
line.
A
So
today,
let's
think
about
an
example,
our
typical
machine
balance
today
on
a
gpu
or
cpu
might
be
5
to
10
flops
per
byte.
That
corresponds
to
about
50
floating
point
operations
per
double
precision.
Word:
that's
really,
fundamentally,
an
artifact
of
technology
and
money
and
it's
unlikely
to
improve
in
the
future,
so
just
get
used
to
that.
A
So
that
means
we
can
set
our
inflection
point
our
machine
balance.
At
about
five,
we
can
consider
a
trivial
vector,
vector
operation
in
this
case
kind
of
a
dax
b,
where
we
read
two
vectors
x
and
y
scale.
One
of
them
add
them
together
and
write
to
a
third
vector
z.
A
This
loop
nest
does
two
floating
point
operations
per
iteration
and
it
transfers
24
bytes
per
iteration.
That
is
read,
read
and
write.
That
gives
us
an
arithmetic
intensity
of
2,
divided
by
24
1
12
of
0.083
flops
per
byte.
Where
does
0.083
lie
on
this
graph?
Well,
it's
very
very
far
to
the
left
of
the
machine
balance.
That
means
that,
no
matter
what
we
do,
we
are
ultimately
memory
bandwidth
bound
on
this
curve
and
we
will
get
a
fairly
low
flop
rate.
A
So,
let's
consider
a
more
interesting
example
where
we
have
some
degree
of
data
reuse.
So
in
this
case
we
have
a
pde,
that's
descritized
into
a
seven
point.
Stencil.
A
We
can
observe
that
we're
going
to
read
seven
points
here
right,
one
point
here,
but
if
we
do
this
simple
calculation
of
saying
seven
flops
divided
by
64
bytes,
we
get
the
wrong
arithmetic
intensity.
A
A
So,
let's
think
back
to
this
question
of
what
is
good
perform,
and
so,
if
we
think
back
to
that
initial
set
of
loop
nests
where
performance
is
completely
random,
what
can
we
do?
Well,
we
can
take
that
completely
random
set
of
loop
nests
and
rather
than
ordering
them
by
just
the
order
in
which
we
happen
to
run
that
loop
nest.
We
can
order
them
based
on
arithmetic
intensity.
A
We
can
then
plot
those
performance
numbers
relative
to
the
mach,
the
roofline
model,
the
the
flop
and
the
bandwidth
capability
of
our
machine
and
highlight
which
of
those
kernels
actually
lie
in
our
bandwidth
bound,
which
lie
in
the
compute
bound
and
which
lie
in
the
poor
performance
regions
of
our
code
of
our
application
machine.
A
This
means
that
we
can
get
some
kernels
that
actually
have
very
low
performance,
but
are
making
good
use
of
the
machine
because
they're
making
high
use
of
memory
bandwidth,
we
have
other
kernels
which
are
getting
high
performance
like
this
red
one,
but
it's
actually
making
poor
use
of
the
machine
because
it's
not
actually
getting
more
than
50
percent
of
stream
or
50
of
peak
bandwidth.
A
Thus,
we
can
focus
our
optimization
efforts
on
those
three
red
dots
to
try
to
improve
their
performance.
Broadly
speaking,
roofline
is
going
to
be
made
of
two
components
and
that's
kind
of
your
high
level.
Take
away
from
this,
that
is,
you
have
the
machine
model
which
defines
the
diagonals
in
the
roof
line.
This
is
your
peak
bandwidth
and
your
peak
flops
and
the
obvious
transition
point
between
the
two
and
then
you
also
have
the
application
characteristics.
A
These
are
the
dots
that
are
defined
by
for
each
loop
nest,
how
many
flops
it
performed
and
how
many
bytes
it
moved
the
basically.
This
boils
down
into
two
activities:
machine
benchmarking,
to
give
us
the
lines
and
application
instrumentation
to
give
us
where
the
dots
actually
lie.
A
So
when
we
go
to
actually
optimize
these
codes,
what
do
we
do?
Well,
we
think
about
how
we
first,
how
do
we
get
to
the
roof
line?
That
is,
for
these
red
dots?
We
want
to
take
that
red
dot
and
get
it
as
close
to
the
peak
flop
rate
of
the
machine.
That
is,
we
want
to
move
this
red
dot
vertically,
but
that's
not
the
only
case
we
may
have.
If
we're
down
here
at
this
green
dot,
we
can't
really
move
this
green
dot
vertically
too
easily.
A
What
we
really
need
to
do
is
increase
its
arithmetic
intensity
to
increase
its
arithmetic
intensity.
We
have
to
remove
data
or
move
less
data
by
moving
less
data,
the
ratio
of
flops
to
data
movement
increases,
and
thus
we
can
basically
take
this
green
dot
and
have
it
slide
along
the
diagonal
to
higher
performance.
A
So
this
raises
the
question:
how
can
performance
ever
be
below
the
roofline?
So
there
are
a
number
of
scenarios
that
this
actually
can
occur
in.
We
can
imagine
the
case
where
we
have
insufficient
cash
bandwidth
and
insufficient
data
locality
that
is
rather
than
having
dram
boundary
performance.
The
cash
bounds
are
performance.
A
A
Unless
we
have
completely
idle
sms,
we
can
have
something
else
where
we
have
an
interesting
instruction
mix
where
we're
not
using
the
fuse,
multiply,
add
on
a
gpu
or
we
have
mixed
precision
or
we're
not
really
using
tensor
cores
for
machine
learning
applications
or
finally,
we
may
be
running
integer
heavy
code
codes
where
we're
not
really
thinking
about
how
many
flops
we're
doing
we're
thinking
about
how
many
instructions
per
second
we're
doing
in
the
traditional
roof
line.
Arithmetic
intensity
is
flops
per
byte.
A
If
we
have
no
flops,
we
have
an
arithmetic
intensity
of
zero
which,
on
a
log,
log
scale
is
unplotable,
so
in
that
case
we
really
want
to
switch
to
some
thing
like
instructions
per
byte
or
instructions
for
transaction.
So
this
has
given
rise
to
a
number
of
research
activities
that
have
been
codified
into
papers
and
methodologies.
A
In
this
case,
we
have
a
hierarchical
roofline
model
for
gpus.
I
point
you
to
the
paper
by
charlene
from
last
year:
there's.
A
Similarly,
a
roofline
scaling,
trajectories
methodology
that
khaled
ibrahim
developed,
which
allows
us
to
understand
thread
scalability
as
well
as
potentially
gpu
thread
block
scalability,
where
you
can
actually
understand.
As
you
increase
the
amount
of
parallelism,
how
does
performance
and
data
locality
change
that
is?
A
Did
we
actually
drive
up
performance
to
the
point
where
we
became
every
bandwidth
bound
when
it
comes
to
the
instruction
mix
charlene
as
well
as
more
recently
torsten,
yunsung
and
yan
zhang
have
been
developing
methodologies
to
allow
for
the
analysis
of
tensor
core
accelerated
applications,
a
preview
of
that
existed
in
the
same
paper
I
mentioned
on
the
left
and
then
finally,
nanding
developed
the
instruction
roofline
model
for
gpus,
which
allowed
her
to
understand
performance
as
a
function
of
the
number
of
warps
per
memory
transaction
that
tells
us
for
integer
heavy
codes.
A
How
likely
it
is.
We
are
limited
by
the
instruction
throughput
rather
than
flops.
So
in
summary,
we
can
think
about
why
we
use
roofline.
First,
we
use
it
to
determine
when
we're
done
optimizing
code.
That
is,
if
you're
right
at
the
roofline
limit,
you're
done
in
simple
optimization,
but
you
may
need
to
motivate
algorithmic
changes.
That
is,
if
you're
at
the
roofline
in
the
bandwidth
bound
regime.
A
It
allows
you
to
understand
why
cpu's
gpus
or
other
architectures
may
be
different
or
similarly
how
an
ampere
may
give
you
better
performance
than
a
volta,
and
thus
it
allows
you
to
predict
that
performance
on
future
machines,
so
takeaway
roofline
allows
you
to
understand
that
application
relative
to
the
machine
capability,
and
it
really
is
useful
for
helping
frame
the
conversation
between
application
developers
who
know
the
application
well,
computer
scientists
who
may
be
very
good
at
performance,
optimization,
applied
mathematicians
who
understand
the
understanding
underlying
algorithms
and
the
processor
vendors,
who
may
be
extremely
knowledgeable
about
the
mic
or
architecture
of
the
target
architecture,
but
may
not
understand
the
applications
as
well
as
other
groups
here,
and
thus
it
provides
this
common
mental
model
and
common
language
for
discussing
optimization.
B
Yeah,
thank
you,
sam.
That's,
a
very
good
presentation.
I
think
both
of
our
panelists
and
our
audience,
who
learn
a
lot.
I
think
blue
fly
model
is
a
must-have
tool
for
hpc
developers.
So
maybe
we
have
just
one
question
if
from
the
audience
or
or
we
can
also
like,
if
you
can
stay
with
us
for
several
minutes
in
the
chat,
if
that's
okay,
for
you,
jan.
B
C
Okay,
so
sam
thanks
and
sam
I'd
like
to
ask
actually
two
questions.
The
first
thing
is
that,
can
you
give
some
of
the
examples
that
have
used
roofline
model
from
start
to
end,
mainly
relying
on
the
data
that
was
shown
on
your
roof
line,
model?
Outputs
and
second
question?
Are
there
any
alternatives
in
the
sense
that
I
don't
think
this
is
a
silver
bullet
and
in
the
case,
if
we
go
back
when
you
have
published
it
with
what
termina
and
paterson
for
the
first
time,
you
might
have
already
have
looked
into
the
alternatives.
A
Right
so,
in
most
cases,
we've
actually
started
with
the
applications.
After
the
fact
that
is,
the
applications
in
many
cases
have
existed
for
years,
if
not
decades,
and
thus
we're
applying
roofline
to
that
existing
application.
A
There
are
a
few
cases
where
root
plane
has
been
very
useful
for
applied
mathematicians
to
think
about
where
their
performance
actually
lies
relative
to
the
roofline
for
a
given
discretization
of
a
pde.
When
you
end
up
in
that
regime,
you
may
see
that
now
and
forever,
your
existing
methods
will
be
bandwidth
limited
and
the
only
way
you're
going
to
get
better
performance
on
those
is
to
increase
memory
bandwidth.
Unfortunately,
in
that
case,
bandwidth
increases
very
slowly
relative
to
compute,
and
thus
they
are
in
a
bit
of
a
quandary.
A
They
rectify
that
by
reinventing
the
algorithm,
they
change
the
discretization
of
their
pde
so
that
they
actually
move
less
data
while
performing
more
total
flops.
The
result
is,
they
can
get
both
get
better
performance
and
they
do
less
work
because
they're,
basically
exploiting
the
excess
compute
capability
on
a
cpu
or
gpu.