►
From YouTube: Understanding AMD GPU ISA
Description
Rene van Oostrum from AMD presents a talk on Understanding AMD GPU ISA. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: OisÃn Creaner
A
A
A
Well,
I've
found
that
it
helps
to
get
a
better
mental
model
of
how
our
hardware
operates,
and
also
when,
while
working
on
performance,
optimization
seeing
and
understanding
what
the
compiler
does
helps
in
making
optimization
decisions,
the
slide
deck
of
this
talk
will
be
available
for
download,
and
it
has
a
couple
of
extra
slides
that
I
won't
show
today
and
that
give
pointers
to
resources
such
as
our
public
iso
reference
manual
and
the
application
binary
interface.
In
case
you
want
to
dive
deeper.
B
A
The
topic
than
I'll
be
able
to
cover
today
and
with
that
said,
let's
dive
right
in
so
consider
the
following
hip
kernel,
which
is
coincidentally,
also
a
valid
cuda
kernel.
It
computes
y,
is
alpha
times
x,
plus
y,
where
x
and
y
are
vectors,
and
alpha
is
a
scalar
as
simple
as
this
kernel
is.
It
contains
elements
that
are
common
to
nearly
all
kernel.
A
A
A
Scalar
registers
are
registers
where
there
is
a
single
instance
per
wavefront
visible
to
all
threads
registers
are
32
bits
wide,
but
we
can
combine
consecutive
registers
if
we
need
64
bits
to
operate
on
scalar
registers,
we
have
scalar
instructions.
If
we
look
at
the
operands.
Typically,
the
target
register
or
memory
address
comes
first,
followed
by
one
or
more
source,
operands
and
load
and
store
instructions
may
have
an
offset
too.
A
A
A
Dendrite
is
appointed
to
the
region
of
device
memory
that
holds
the
kernel
arguments
then
scalar
register
s8
holds
block
idx
x,
which
is
the
same
for
all
threads
in
the
wavefront
and
vector
register
v0
pulse
threat,
idx,
which
is
unique
to
each
thread
in
the
wavefront.
So
here
we
see
how
scalar
registers
are
used
for
values
that
are
common
to
all
threads
in
the
wavefront.
A
Let's
analyze,
the
iso
line
by
line
s67
in
the
instruction
on
line
2
holds
a
pointer
to
the
kernel
argument,
region
and
device
memory
on
the
bottom
right.
We
see
what
it
looks
like.
The
first
argument
is
a
four
byte
end.
Then
we
have
four
bytes
of
padding
and
the
remaining
arguments
are
all
eight
bytes.
The
instruction
in
line
two
loads,
the
value
at
offset
zero,
so
that
is
n
and
it
stores
the
value
of
n
in
scalar
register
as
zero.
A
But
this
field
in
the
dispatch
packet
is
a
two
byte
value
and
since
we
don't
have
instructions
to
load,
just
two
bytes,
we
are
loading
both
workgroup
size
x
and
workload,
size
y
and
store
it
in
s1,
we'll
deal
with
the
unneeded
workgroup
size
y.
In
a
moment
after
executing
the
first
two
isa
instructions
in
lines,
2
and
3,
we
have
the
value
of
n
in
scalar
register
s0
and,
as
one
holds
both
blocked
in
x
and
block
them
y,
where
the
letter
is
undefined,
or
at
least
that
is
what
s0
and
s1
hold.
A
It
is
the
job
of
the
compiler
or
the
assembly
programmer
to
make
sure
that
data
dependencies
are
satisfied,
and
that
is
where
the
scalar
instruction
s
weight
count
comes
in.
The
idea
is
as
follows:
whenever
a
memory
instruction
is
issued,
the
counter
is
incremented
and
there
are
separate
encounters
for
scalar
loads
and
loads
and
stores
to
lds
on
the
one
hand
and
vector
loads
and
stores,
on
the
other
hand,
and
lds
stands
for
local
data
store.
This
is
what
encuda
is
called
shared
memory.
A
There
is
a
bit
more
to
the
counter
story,
but
what
I'm
showing
here
suffices
for
understanding
our
eyes
at
them.
In
any
case,
the
counters
are
incremented
when
memory
instructions
are
issued
and
decremented.
When
the
reference
data
arrives
at
its
destination,
the
s
weight
count
instruction,
halts
execution
of
the
wavefront
until
the
counter
has
reached
a
specified
value
for
lds
loads
and
stores
and
vector
loads
and
stores,
instructions
complete
and
the
order
in
which
they
are
issued,
but
for
scalar
loads.
This
is
not
the
case.
These
may
complete
out
of
order,
so
for
scalar
loads.
A
Once
the
data
from
the
scalar
loads
has
arrived
in
s0
and
s1,
we
can
deal
with
the
possible
garbage
value
and
the
high
order
word
of
s1.
We
take
care
of
this
with
a
scalar,
logical
and
operation,
with
a
bit
mass
that
has
all
ones
and
the
two
lower
order
bytes.
So
the
two
lower
order,
bytes
of
s1
containing
block
dimx,
are
retained
and
the
two
higher
order
byte
are
reset
to
zero.
A
In
line
six,
we
multiply
s8,
which
was
initialized
to
hold
block
idx
with
s1,
and
we
just
loaded
block
dmx
and
we
reuse
s8
to
store
the
result
of
the
multiplication
line.
Seven
is
our
first
vector
instruction:
it
adds
v0,
which
is
the
threat
idx
value
that
is
unique
to
all
threats
to
the
value
in
s8
so
effectively.
This
completes
the
computation
of
I
in
the
first
line
of
the
hip
source
code
that
is
shown
at
the
bottom
right
of
this
slide.
A
At
this
point,
v0
holds
for
each
thread
the
index
I
of
the
arrays
x
and
y
that
it
is
to
operate
on,
but
we
may
have
more
threads
than
array
elements,
so
we
should
disable
any
further
computation
for
threads.
That
would
access
the
arrays
out
of
bounds,
and
that
is
what
the
last
three
lines
of
the
first
assembly
block
take
care
of
line.
8
compares
s
0,
which
holds
n
with
v
0,
which
now
holds
I.
A
A
A
A
Let's
summarize,
what
we
have
done
so
far
line
two
loads:
the
value
of
n
into
scalar
register
at
zero
lines.
Three
through
seven
compute
the
index
I
for
each
thread
and
stored
in
vector
register,
v,
zero
lines,
eight
and
nine
set
the
bit
and
the
execution
mast
to
one
for
threads.
That
should
remain
active
and
line
nine
jumps
to
the
end
of
the
program
if
there
are
no
active
threads.
A
Next,
let's
have
a
look
at
the
second
half
of
the
assembly
dump.
This
block
is
only
executed
by
the
active
threads
lines.
12
and
13
use
the
pointer,
2d
kernel
arguments
again.
This
pointer
is
stored
in
s67,
I'm
showing
the
kernel
arguments
with
our
offsets
in
the
top
left
of
this
slide
line:
12
loads,
4
d
words,
so
two
8
byte
values,
starting
at
offset
a
so
it
reads
alpha
and
x
in
one
instruction
where
alpha
ends
up
in
s01
and
x
in
s23.
A
A
A
Each
thread
has
I
stored
in
v0
a
32-bit
register.
The
pointers
are
64
bits,
wide
line,
14
and
15.
First
turn
the
32-bit
value
of
I
into
a
64-bit
offset
from
the
start
of
the
arrays
line.
14
does
a
31-bit
right
shift
with
sign
extension
of
the
value
in
v0
and
stores
the
result
in
v1,
so
v1
ends
up
with
all
zeros
if
the
assignment
of
v0
is
0
or
all
ones.
Otherwise,
in
other
words,
v01
is
now
the
64-bit
representation
of
the
value
of
v0.
A
Note
that
higher
order
bits
are
in
the
higher
numbered
register
of
the
pair
line.
15
does
a
left
shift
by
3
positions
of
the
64-bit
value
and
v01,
which
comes
down
to
doing
a
multiplication
by
8.,
since
the
values
in
the
array
x
and
y
are
doubles.
The
offset
of
the
I
double
from
the
beginning
of
the
array
in
bytes
is
8
times
I,
and
hence
the
multiplication.
A
Lines
17
through
19
add
the
thread
offset
in
v01
to
the
pointer
to
the
first
element
of
x
in
s23.
The
result,
a
pointer
to
x,
I
for
each
thread,
is
stored
in
v23
amd
gpu.
Isa
does
not
have
separate
instructions
for
64-bit
integer
edition.
Instead,
the
64-bit
edition
is
executed
as
two
consecutive
32-bit
editions
with
carryout
and
carry-in.
A
B
A
A
Since
the
pointer
to
y,
I
is
still
needed
to
write
the
results
back
in
line
27
line,
25
waits
until
the
values
read
in
the
previous
two
lines
have
arrived
from
device
memory
and
the
target
registers
and
are
ready
for
subsequent
use
line.
26
does
the
core
computation
of
the
kernel
it
multiplies
x
I
and
v
2
3
with
alpha
and
s.
0
1
adds
y,
I
and
v
4,
5
and
stores
the
result.
Alpha
times
x,
I
plus
y,
I
and
v
2
3.
A
A
This
is
my
final
slide,
showing
a
high
level
summary
of
the
second
half
of
our
kernel.
But
since
my
time
is
about
up
I'll,
just
leave
it
up
here
where,
while
we
open
the
floor
for
questions.
B
Thank
you
for
that.
So
those
were
some
very
in-depth
and
fascinating
topics
there,
I'm
just
looking
to
see
if
there's
some
hands
up
amongst
the
audience.
B
I've
just
allowed
you
to
unmute
your
mic.
Sorry
ist,
just
like.
If
you
mute
yourself,
you
can
ask
the.
B
It
it
suck
it's
the
name,
I'm
seeing
there
you've
just
been.
You
raise
your
hand.
C
A
Oh,
I
see
what
you
mean
so
in
this
kernel.
We
don't
need
them
in
general,
you
could
use
them,
for
instance,
in
in
reductions.
B
Good,
we
have
another
question
here,
and
this
is:
how
does
the
differing
wavefront
size
in
amd
gpus
versus
warp
size
affect
applications?
Do
users
need
to
think
about
this
explicitly.
A
It
depends
so
if
you're
porting
between
the
two
architectures
say
you're
you're,
going
from
cuda
to
to
hip,
and
you
have
a
code
where
the
wavefront
the
warp
size
is
hard
coded
as
32.
A
B
Very
good
another
question
here
wondering
what
was
the
sort
of
what
what
would
you
say
is
the
the
sort
of
core
take-home
message
that
you
would
take
from
the
the
walk
through
here.
What
what
if
there
was
one
thing
for,
maybe
a
less
experienced
speaker
should
what
should
they
take
from
from
this?
This.
A
Well,
one
thing
is
that
if
you,
if
you
look
at
the
hip
source
for
this
kernel,
it's
basically
an
fma
that
is
executed
by
many
threats.
And
then,
if
you
look
at
the
isotherm
that
I'm
showing
here,
you
see
that
the
fma
instruction
is
in
line
26
and
and
all
the
rest
is
scaffolding.
A
Admittedly,
you
shouldn't
figure
this
out
by
staring
at
assembly
in
the
first
place,
as
max
already
mentioned,
you
should
always
use
profilers
too
to
determine
where
your
time
is
spent,
but
this
kind
of
isotherms
can
be
quite
revealing
in
what
is
going
on
under
the
hood,
and
it
helped
me
more
than
once.