►
From YouTube: 5. Profiling and Debugging
Description
Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/
A
A
Okay,
so
so
learning
objectives
so
learn
some
tips
to
help
debug
and
signal
learn
how
to
profile
sickle
code
for
the
cuda
back
end.
Also,
learning
about
coalesce
global
memory,
access
learn
some
optimization
tips,
so
in
sickle
errors
are
handled
by
throwing
exceptions.
It
is
crucial
that
these
errors
are
handled.
Otherwise
your
application
could
fail
in
unpredictable
ways.
A
So
in
signal
there
are
two
kinds
of
error,
synchronous
errors
and
asynchronous
errors,
so
asynchronous
errors
will
typically
only
materialize
when
you
call
weight
on
you
know
a
queue
on
an
event
whatever
yeah,
so
they
will
only
appear
when
weight
is
called
on
a
qr
event.
A
So
if
using
default,
constructor
cues
use
sickle
device
filter
to
run
code
on
the
host.
This
is
a
really
important
debugging
strategy,
yeah.
If,
if
it's
not
working
on
the
device
or
if
it's
not
working
on
the
host,
it's
definitely
not
going
to
work
on
the
device,
but
if
it
is
working
on
the
host,
it
might
still
not
work
on
the
device.
A
But
you
need
to
take
this
box
to
make
sure
that
you
don't
have
any
problems
on
house
first,
you
can
also
specify
the
thread
pool
size
is
equal
to
one
which
makes
it
completely
serial
the
execution
for
the
queue
which
is
not
necessarily
the
default,
but
yeah
yeah
worth
worth.
Knowing
so
normal
tools
like
gdp,
normal
c
plus,
plus
c
tools,
gdb
val
grant,
can
be
used
with
normal
sql
code
in
kernel.
Printfs
are
a
great
way
to
debug
code
from
device,
so
yeah
anytime.
A
Once
everything
is
working
on
host
and
you've
done
gdb
and
file
ground
and
whatever
and
everything's
fine,
then,
if
things
still
aren't
working,
then
printfs
are
the
way
to
go
yeah.
So
here
we
have
c
device,
filter
host
and
then
single
q
thread
thread
pool
size
is
equal
to
one
and
then
just
running
gdb,
and
then
also
we
can
do
the
same
with
file
ground.
That
kind
of
thing,
okay,
so
yeah
the
very
basic
debugging
strategies,
but
very
useful,
okay,
optimization.
A
So
one
of
the
most
important
things,
one
of
the
things
that
will
dictate
performance
significantly
is
how
you
access
global
memory.
So
when
you're
accessing
so
memory,
access
patterns
can
significantly
affect
performance,
and
this
is
especially
important
when
reading
or
writing
to
global
memory.
Yes,
so
the
main
thing
here
is
global
memory.
So,
if
you're
accessing
say
local
memory
or
shared
memory
is
called
in
cuda,
it's
not
as
important,
but
certainly
global
memory
is
very
important.
A
Essentially
these
are
your
work
items.
You
want
them
to
be
accessing
adjacent
parts,
adjacent
bits
of
of
memory.
You
want
this
work
item
to
be
accessing
the
the
the
element
in
the
one
next
to
it
and
so
on.
Okay,
you
want
to
be
using
because
essentially,
what
the
memory
manager
is
going
to
do
is
it's
just
going
to
load
a
particular
say
line
of
this.
This
memory
and
it's
not
gonna,
discriminate
it's
not
gonna.
Let's
say
I
only
wanted
that
and
that
and
that
it's
not
gonna
discriminate,
and
it's
not
gonna.
A
Take
that
and
that
there's
a
there's,
a
cached
line,
size
that
is
default
and
then
every
time
we
ask
for
memory,
it'll
get
you
this
amount
and
then
we
want
to
make
sure
that
we're
using
as
much
of
that
as
possible.
So
this
is
really
good.
Coalesce
global
memory
access.
A
Yes,
so
100
global
access,
utilization,
okay,
this
is
quite
bad.
So
if
we're
accessing
every
second
part
of
memory,
then
half
of
this
this
load
is
has
gone
to
waste
and
we
see
that
these
work
items
have
nothing
to
do.
Okay.
So
in
the
worst
case
scenario,
these
would
actually
maybe
be
waiting
on
these
to
get
their
their
data
potentially.
A
A
Okay,
this
is
yeah
a
word
of
caution,
so
index
flipping,
so
signal
ranges
and
then
the
ranges
are
row
major
okay,
so
that
essentially
means
that,
for
some
kind
of
say,
two-dimensional
three-dimensional
range
so
work
item
with
sickle
id
ij
is
neighbors
with
I
and
j
plus
one
okay,
so
obviously
in
all
c,
plus
plus
c
kind
of
or
at
least
most
c
c,
plus
plus
apis.
A
A
So
in
cuda
threads
I
j
and
I
plus
one
j,
our
neighbors,
so
cuda
organizes
its
work
items
it's
threads
in
a
column,
major
format.
So
we
need
to
make
sure
that
we're
aware
of
this
and
that
well,
a
good
rule
of
thumb
is
not
to
calculate
a
linear
index
manually
better
to
use
the
member
functions.
Get
local
linear
id
get
global,
linear
id
okay.
A
So
what
happens?
If
we
do
it
manually
and
we,
we
use
a
row
major
data
and
column
major
memory
access,
so
we'll
still
have
we're
still
using
all
of
the
elements,
but
essentially
because
we're
we're
understanding
this
thread
to
be
adjacent
to
that
thread,
as
in
organized
in
a
column
major
way
to
be
adjacent
to
that
thread,
that
that
thread.
A
We're
kind
of
messing
up
our
memory
access
patterns
on
the
a100.
Potentially
this
isn't
actually
a
problem.
Maybe
the
the
latest
nvidia
hardware
is
able
to
to
work
with
this
and
and
it's
okay,
but
as
it
is
with
the
previous
example,
where
things
were
just
flipped
the
the
other
way
around,
but
I
would
say
that
you'd
still
suffer
some
performance
loss
with
this
particular
kind
of
access
pattern.
A
By
contrast,
if
you
know
that
you're
using
real
major
data
and
also
a
row
major
work
item
kind
of
layout,
then
you
can
access
these
very
very
nicely
and
this
is
guaranteed
to
be
to
be
optimal.
This
might
also
be
you
know
optimal,
but
this
this
this
is
guaranteed
to
be
the
best
memory,
access
pattern
that
you
can
that
you
can
get
so
again
to
avoid
this,
do
not
calculate
the
linear
id
yourself
use
these
member
functions.
A
Okay,
so
a
few
very,
very
quick,
so
obviously
you
could
you
could
talk
for
a
long
time
about
optimization
strategies,
but
very
very
quickly,
so
different
problems
are
optimal
for
different
work
group
sizes,
so
you
should
test
them
and
kind
of
benchmark
and
see
which
is
the
best
and
then
stick
with
that
minimize
memory
transfers.
This
kind
of
goes
without
saying
memory,
transfers
take
time,
going
from
hosted
advice,
device,
hosts
and
so
on
in
general,
prefer
malik
device
over
malik
shared.
A
This
is
true
if
there
isn't
any
physical
shared
memory.
If
there
is
physical,
shared
memory,
then
this
will
be.
You
know,
amazing,
but
so,
for
instance,
if
you're
running
on
an
intel
kind
of
unified,
graphics,
cpu
kind
of
thing,
then
you
you
have
physically
shared
memory
and
if
you're
using
this,
this
is
really
great,
but
in
general,
saying
cuda.
A
This
is
done
with
cuda
malik
managed,
which
relies
on
page
faults
to
essentially
move
move
values
here
and
there
and
that's
that's
slower
than
than
explicitly
moving
moving
data
around
so
very,
very
easy,
optimization
inline
functions
if
you're
calling
a
function
from
within
a
kernel,
then
inline
it
see.
A
If
you
get
a
performance
gain
recently
on,
our
team,
teddy
was
running
a
benchmark
and
he
managed
to
get
a
30
speed
up
just
by
inlining
like
a
list
of
functions
just
by
inlining,
so
it
it
can
really
give
a
a
good
yeah.
Good
performance
gain
use
local
memory
where
possible,
yeah.
If
the
the
algorithm,
the
lends
itself
to
local
memory,
then
definitely
use
it
as
opposed
to
using
global
memory.
A
Keep
work
groups
converged
where
possible.
Okay,
so
make
sure
that
there
isn't
too
much
divergence
in
the
control
flow
within
a
particular
work
group,
newer
hardware:
it
does
better
with
this,
and
you
know
things
like
independent
forward.
Progress
of
work
items
is
possible
for
new
york
kuda
devices,
but
you
know
a
rule
of
thumb
is
try
and
get
most
of
the
try
and
get
your
work
groups
as
aligned
as
possible
in
their
in
their
kind
of
execution.
A
Also
very
easy
thing
use
a
single
native
namespace,
for
example,
signal
negative
sign
if
the
native
accuracy
is
tolerable.
So
once
you
use
the
signal
native
functions,
then
you're
you're,
just
relying
on
the
the
precision
of
the
of
the
native
functions,
and,
if
that's
okay,
then
that's
great,
if
not
so
using
the
the
sickle.
Namespace
functions
have
certain
precision
guarantees,
but
perhaps
they're
not
necessary,
but
the
native
functions
usually
have
less
precision,
but
not
always
not
always,
okay,
so
profiling,
so
standard
nvidia
tools
are
still
available
when
you're
profiling.
A
A
You
can't
have
this
parallel
for
having
that
name
and
then
that
parallel
for
having
the
same
name
with
that
single
task
having
the
same
name,
so
this
can
be
useful
just
because
the
the
output
that
you
get
from
a
profiler
is
usually
sort
of
verbose
and
has
a
lot
of
words
in
it.
And
if
you
can
spot
aha
my
reduce
kernel,
then
it
makes
it
easier.
A
Okay,
so
ensis
ensis,
again
part
of
the
nvidia
toolkit
kind
of
profiling,
stuff,
okay,
so
this
can
be
used
for
tracing
and
also
for
for
timings.
So
a
very
simple
tracing,
very
simple
kind
of
timing
output,
which
is
similar
to
nvprof.
If
you've
used
that
before
it's
just
ensis,
that
should
be
a
space,
there
should
be
a
space.
It
should
be
ns's
profile,
I'll
confirm
that
in
a
second
there
should
be
a
space
after
that
and
then
you'll
get
this
kind
of
an
output.
A
This
gives
you
good
kernel
statistics,
so
it
kind
of
like
this
kernel
submission
will
give
you
a
little
bit
of
a
kind
of
long-winded
name,
but
you
know
it's:
okay
and
it'll,
give
you
the
timings
and
the
timings
of
mem,
copies
and
so
on.
So
we
can
see
that
them
copies
are
actually
a
pretty
significant
time
of
our
total
kind
of
total
operations,
which
is
sometimes
not
always.
A
Ncu
can
also
be
used
for
detailed
kernel
analysis.
If
you
want
to
measure
things
like
occupancy,
you
can
just
use
it
in
the
in
the
usual
way.
So
yeah
you
can
run
your
ncu
command
like
this
and
then
get
all
this
output,
so
things
about
block
size
grid
size.
There
should
be
things
about
theoretical
occupancy
achieved
occupancy,
so
this
is
not
very
good.
This
is
achieved
occupancy
of
6.11
over
50,
which
is
not
that
good.
But
this
is,
I
think,
a
bad
example
that
I
tried
to.