►
From YouTube: 4. ND Range Kernels
Description
Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/
A
Okay,
so
this
is
about
nd
range
kernels
and
more
specifically,
kind
of
advanced
advanced
features
like
using
local
memory.
So
there's
a
bit
of
duplication
here
so
I'll
skip
into
those.
A
So
we
want
to
learn
about
the
signal,
execution
and
memory
model.
We
want
to
learn
how
to
enqueue
an
nd
range
kernel,
any
range
kernel
functions
and
how
to
use
local
memory
so
again,
yeah
the
fundamental
unit
is
work
items.
This
is
kind
of
taken
from
the
previous
slide
in
the
previous
section.
These
are
organized
in
work
groups
and
then
work
groups
are
organized
in
nd
ranges.
A
A
So
this
is
a
16
by
two
nd
range
and
then
the
local
range
is
just
four
by
one
sorry
yeah.
I
suppose
it
would
be
four
by
one
yeah.
A
A
So
this
particular
say
work
item
in
this
nd
range,
so
there's
a
global
range
which
is
12
12,
there's
a
global
id.
So
this
work
item
has
global
id
six
five.
So
it's
zero
one,
two,
three,
four:
five:
six,
zero
one,
two,
three,
four:
five.
This
work
item
has
global
id
65.
The
group
range
okay.
So
this
is
the
number
of
groups,
okay
or
the
the
number
of
blocks
if
you're
used
to
dealing
in
cuda.
So
this
is
three
three
the
group
id.
C
A
Okay,
the.
B
A
Sorry,
I
should
be
going
right
and
then
done.
Zero.
C
A
And
then
zero
one
says
two
one:
okay,
this
is
actually
yeah,
we'll
talk
more
about
this
in
the
later
slide,
but
yet
the
way
that
you
index
into
these
ranges
is
important
and
it's
actually
the
opposite
of
what
it
is
in
cuda,
which
is
yeah
important
to
know.
A
So
a
signal
execution
model-
typically
an
nd
range
invocation
in
an
nd
range
invocation
sickle-
will
execute
the
sql
kernel
function
on
a
very
large
number
of
work
items,
often
in
the
thousands.
So
this
this
allows
us
to
achieve
good,
occupancy,
good
kind
of
use
of
our
compute
units
and
say
a
gpu
or
some
other
offloading
device.
A
Yes,
this
is
sorry
bit
of
duplication
from
the
previous
one.
We
can
again
synchronize
across
a
work
group
using
a
barrier
okay,
so
this
is
especially
important
if
we're
dealing
with
local
memory,
which
we're
going
to
be
detailing
in
a
bit
okay,
so
again,
private
memory.
Each
work
item
has
memory
which
is
completely
private
to
it.
So
this
is
like
a
register
or
something
we
also
have
local
memory.
So
a
work
group
shares
the
local
memory.
A
So
this
work
item
can
write
to
local
memory,
which
can
then
be
read
by
another
work
item.
We
need
to
make
sure
that
if
that
is
to
happen,
that
we
use
a
barrier
to
make
sure
that
something
a
right
has
indeed
finished
before
the
read
happens
and
yet
very
important
that
a
work
group,
a
work
item
in
a
worker
cannot
access
the
local
memory
of
another
work
group.
They
can
only
access
its
own
local
memory,
okay
and
then
again,
global
constant
memory,
okay,
yeah,
so
private
memory,
very,
very
fast.
A
Local
memory
is
pretty
fast.
I
think,
theoretically,
between
a
few
cycles
and
maybe
10
cycles,
to
get
to
local
memory,
whereas
private
memory
is
theoretically,
maybe
one
or
two
cycles
that
it
takes
to
load
this,
whereas
global
and
constant
memory
is
within
the
hundreds
of
cycles.
So
it's
a
lot
slower.
So
if
we
can
use
local
memory
for,
say
intermediate
computation,
we
should
really
really
do
this,
whereas
global
memory
is
of
course
necessary.
A
For
you
know,
the
initial
reading
in
and
writing
out
of
values
and
global
memory
is
larger
than
local
memory,
and
local
memory
is
larger
than
private
memory
private
memory.
So
this
is
yet
the
kind
of
speed
versus
size
hierarchy.
So
this
is
small,
very,
very
fast,
relatively
small,
relatively
fast
large
and
slow.
A
A
We're
interested
in
this
time
constructing
a
parallel
four
with
an
nd
range.
Okay,
so
we're
only
going
to
be
dealing
with
one-dimensional
nd
ranges
in
this
workshop,
so
an
nd
range
is
made
up
of
the
global
range
okay,
which
in
this
case
is
10
24
and
the
local
range
the
size
of
your
work
group.
So
this
must
divide
that.
A
A
Then
we
can
use
our
nd
items
to
get
things
like
the
the
id
number
the
linear
id.
We
can
also
get
things
like
the
the
group,
the
group
id.
We
can
get
the
the
local
id
we
can
do
a
barrier.
We
can
do
a
lot
of
things
with
our
in
the
item.
A
Okay,
yeah,
so
on
most
hardware
global
range.
I
must
be
a
multiple
of
local
range.
I
so
this
is
not
okay,
okay,
we
won't
be
able
to
construct
an
nd
range
with
this,
because
64
does
not
divide
into
1000
okay,
so
this
will
give
us
an
error.
This
is
okay,
this
divides.
A
Okay
yeah.
So
for
nvidia
hardware,
work,
group,
sizes
or
block
sizes
are
best
chosen
from
well,
usually
not
8
or
16,
but
usually
32,
64,
128,
256,
512
1024.
A
Okay,
so
using
local
memory,
so
this
this
is
a
really
kind
of
important
thing.
If
you
want
to
write
performing
code,
definitely
so
in
sickle,
local
memory
is
called
sorry
sickle.
A
Local
memory
is
called
shared
memory
in
cuda,
as
you
we've
kind
of
mentioned
already
to
use
local
memory,
you
must
use
an
accessor
okay,
so
we've
glossed
over
the
buffer
accessor
model,
but
for
this
for
local
memory
and
also
for
things
like
console
memory
to
texture
memory,
you
need
to
use
accessors
and
the
way
that
we
define
essentially
that
it's
local
memory
is.
We
say
we
construct
an
accessor
with
sickle
access
target
local.
A
So
we
didn't
really
have
to
worry
about
this
q
dot
submit
beforehand,
so
we're
just
dealing
with
q
single
task,
q,
dot
parallel
four.
So
when
you're
trying
to
when
you
need
to
dictate
the
way
memory
is
managed
within
your
kernel
or
the
memory
that
needs
to
be
used
within
the
kernel,
then,
essentially,
you
need
to
define
this
memory
how
this
is
going
to
relate
to
outside
memory
or
or
not.
In
this
case,
this
is
purely
contained
within
the
within
the
the
the
q
dot
submit.
A
This
local
memory
can't
really
go
anywhere,
but
you
need
to
do
this
within
a
submit
function.
You
can't
just
do
it
within
a
say,
a
parallel
for,
because
the
the
memory
kind
of
needs
to
be
set
up
beforehand,
which
is
similar
to
saying
cuda.
If
you
have
a
dynamic,
shared
memory
size,
you
need
to
configure
that
when
you're
calling
the
actual
kernel
or
you
need
to
but
yeah
yeah,
okay,
so
you're
constructing
an
accessor
which
just
says
essentially
give
me
a
chunk
of
memory.
A
Okay
and
you
can
access
the
successor
like
a
pointer,
you
can
access
this
like
a
just
just
a
normal
pointer.
Well,
because
this
is
one
dimensional.
If
it
were
two-dimensional,
then
you
need
to
do
some
more
things,
but
yeah,
one-dimensional
local
memory
is
sort
of
maybe
the
way
to
go.
So
this
is
local
memory
of
size,
local
mem
size
and
it
has
type
t
okay,
and
then
this
can
be
used
within
my
kernel.
A
C
A
A
So
this
is
something
that
kind
of
has
to
so
once
you
do,
a
q
dot
submit
every
subsequent
kind
of
operation
needs
to
be
to
the
handler
for
that
command
group.
So
this
is
the
command
group,
so
it's
a
command.
A
Some
memory
options
yeah
and
this
q.submit
can
only
contain
one
command,
so
a
command
being
a
parallel
four,
a
single
task,
a
mem
copy
and
so
on.
You
can
only
do
this
once
within
within
a
q.submit,
so
we
use
a
submit
when
we
need
to
do
stuff
with
memory.
But
if
we
don't
need
to
do
anything
with
with
memory
with
accessors,
for
instance,
then
we
it's
it's
easier
to
use
the
q
dot
mem
copy
q
dot
so
on,
just
because
it's
less
for
books,
yeah,
so
one
command
within
each.
C
A
A
Okay,
so
again,
yeah
we're
saying
the
access
target
is
local
and
then
you
can
treat
accessors
like
pointers
from
within
kernels,
so
use
this
operator.
Okay,
so
locomm
is
declared
here
with
a
particular
size.
Okay
and
as
well.
Cgh
needs
to
be
passed
to
the
function
as
well.
This
command
group
handler
and
then
once
we
once
we're
doing
our
kernel,
then
you
can
just
use
it
as
a
pointer
just
use
it
as
a
normal
pointer.
That's
fine.
A
Okay,
so
now
for
an
nd
range,
when
you
submit
a
parallel
four
and
you
pass
in
an
nd
item.
Okay,
you
don't
need
to
necessarily
pass
an
ndi
item,
but
it's
useful.
If
you
do,
then
the
nd
item
has
some
really
useful.
Member
functions
like
get
global
linear
id
okay.
So
this
gets
your
global
linear
id
as
as,
if
you're
indexing
into
a
one-dimensional
array,
your
local
linear
id.
A
So
this
is
just
your
within
your
work
group,
your
linear
id
and
also
the
the
group
id
there
are
lots
of
different
things.
Let's
just
imagine
as
well
that
we're
writing
to
local
memory
with
our
local
id
with
some
value,
and
then
let's
say
we
want
to
use
this
value
at
some
other
point:
we're
going
to
use
I
dot
barrier.
So
this
is
a
member
function
of
the
nd
item,
so
this
is
really
important
if
you're
using
shared
memory,
if
you're
using
shared
memory,
usually
every
time
you.
C
A
And
you
want
it
to
be
read
somewhere
else.
You
need
to
make
sure
that
you
do.
A
barrier
barrier
is
also
a
member
function
of
a
group,
so
you
can
also
do
that.
A
Okay,
what
is
the
motivation
of
organizing
memory
in
blocks
of.
D
A
Memory
so
we're
we're
utilizing
the
hardware,
essentially
so
each
block
each
work
group
has
access
to
this
region
of
memory,
that's
quite
fast,
and
it
can
be
shared
between
work
items.
So
if
we
can
use
this,
if
there
are
some
computations
that
can
be
done,
that
utilizes
shared
memory,
essentially
we
can
share.
This
is
the
only
way
that
we
can
share
data
among
work
items.
A
We
cannot
share
data
among
work
items
in
separate
work
groups,
so
it's
only
through
this
shared
memory
that
we
can,
you
know,
have
rights
and
then
followed
by
reads
from
different
work
items.
We're
gonna,
we're
gonna,
see
a
kind
of
simple
example
of
how
we
use
this
to
try
and
to
try
and
optimize
things
in
the
final,
the
final
exercise
but
yeah.
Essentially
the
hardware
is
there,
so
we
want
to
use
it
it's
fast
and
it's
if
you
can
use
local
memory,
your
your
algorithm,
whatever.
C
A
Is
it
can
be,
you
know,
words
of
magnitude
faster
than,
if
you're
naively,
using
global
memory.
If
it's
suitable,
not
every
task
necessarily
needs.
Not
every
task
requires
shared
memory,
but
if
it
does,
then
you
should
use
it.
B
But
just
for
clarification,
what
am
I
going
to
do
if
I
have
a
distributed
memory
machine
so
with
some
gpus?
How
do
I
actually
handle
the
memory?
Do
I
have
to
use
mpi
subroutines
to
keep
transferring
the
data
back
and
forth
and
the
sql
only
for
accessing
the
gpus
or
how
do
I
handle
that
situation?.
A
B
Yes,
so
if
I
have
a
distributed
machine
which
actually
has
on
local
nodes,
has
let's
say
some
accelerators
gpus
or
I
don't
know
fpga
or
something
right?
How
do
I
handle
the
memory
between
like
the
communication,
I
need
to
use
mpi
and
then
then
just
transfer
the
data
back
and
forth
between
the
host
memory
and
the
the
device.
B
E
Yes,
that's
a
really
good
question,
so
cycle
is
sort
of
like
single
node
programming
model.
So
you
would
you
generally
use
something
like
mpi
to
communicate
between
nodes
and
run
like
a
an
instance
of
cycle
with
dpc.
B
So
then,
to
compile
those
I
I
write
my
code.
Essentially,
I
can
write
in
sql
and
just
link
and
also
load.
I
guess
the
mpi
library
right.
B
C
D
Yeah,
I'm
just
gonna
say
we
at
this
exact
moment:
we
don't
have
a
module
for
for
mpi
built
against
the
sickle
compiler.
D
But
if
you're
wanting
to
try
that
out,
I
have
kind
of
a
proof
of
concept
but
largely
untested,
build
that
should
be
able
to
properly
use
the
high
speed
network
on
promoter
so
yeah.
I
I
can
I'm
happy
to
have
a
conversation
over
slack
or
or,
however,
if
you
want
to
take
a
look
at
that
and
then.
D
Comment:
I'm
aware
that
there
is
a
a
research
project
called
celerity
which
is
sort
of,
I
guess
billed
as
is
kind
of
like
a
sickle
for
distributed
memory
platforms,
and
I
mean
that
that's
kind
of
the
extent
of
my
knowledge
of
that
other
than
it's
a
it's
a
research
project
and
it's.
D
Of
the
scope
of
what
we're
targeting
at
the
moment,.
A
Why
is
it
limited
to
3d?
This
is
a
good
question,
so
this
has
just
got
to
do
with
the
signal.
Spec,
I'm
not
entirely
sure
what
the
why
it
was
loaded
at
three
dimensions.
So.
C
I
had
a
thought
you
and
like
an
interest
of
timing
kind
of
getting
through
the
last
section.
Maybe
it
would
make.
C
C
A
The
task
is
to
to
flip
an
array.
Okay,
so
you
have
you:
have
an
array:
reverse
the
direction?
Okay,
so
you
could
naively
I'm
not
gonna
write
all
this
out,
but
you
could
naively
cue
that
parallel
before.
C
A
A
Okay,
the
naive
way
of
doing
it
so
we're
just
reversing
the
the
the
vector
is
the
pointer
b
and
then
we
would
do
something
like,
but
first
we
need
to
get
some
more
images.
Oh,
this
is
much
nice.
Okay,.
A
A
The
the
idea
is
you
just
flip
the
the
array,
so
we
just
do.
A
Index
is
equal
to
the
printer
a
so
this
is
the
input
and
then
that
would
just
be
global,
idx.
Here's
the
id
is
that's
the
naive
way
of
doing
things.
A
But
if
you
use
shared
memory,
so
essentially
you'll
be
accessing
this
in
a
reverse
way.
Okay,
so
each
work
item
will
be
accessing
instead
of
accessing,
like
in
a
where
work
item,
say,
n
or
work
item.
I
is
accessing
point
I
and
work
item
I
plus
one
is
accessing
the
they're
indexing
into
I
plus
one.
It's
going
the
opposite
way,
so
this
is
actually
probably
fine
or
is
it
no?
It
definitely
is
fine
by
modern
kind.
C
A
Cuda
memory
managers
to
index
into
things
in
a
backwards
way
in
a
flipped
way,
but
kind
of
depends
on
the
device.
If
you're
working
on
an
older
device,
then
it
wouldn't
necessarily
be
as
optimal.
So
it
would
be
beneficial
to
use
shared
memory
as
an
in
between
so
we'll
just
look
at
the
solution
to
see.
A
Solution.Cpp,
okay,
so,
instead
of
just
doing
it
the
naive
way
we're
allocating
some
local
memory.
A
We
are
then
loading
the
local
memory
with
the
the
global
value,
then
we're
executing
a
barrier
and
then
we're
writing
back
to
global
memory
in
a
completely
aligned
way,
so
we're
not
flipping
it
and
then
we're
riding
back
from
local
memory.
So
we'll
we'll
talk
about
this
more
in
the
upcoming
section,
but
essentially,
if
you
can
access
global
memory
in
as
uniform
and
contiguous
a
way
as
possible,
this
will
give
you
much
better
performance.
A
But
we'll
we'll
talk
about
this
more,
I
haven't
really
explained
exactly
what
what
I
mean,
but
if
we
try
and
get
work
item
I
and
I
plus
one
accessing
adjacent
points
in
memory,
then
that
will
give
us
better
performance.
In
this
case
they
actually
are
accessing,
say
in
the
naive
one.
They
are
accessing
adjacent
points
in
memory,
but
they
were
flipped.
A
So
I
plus
I
and
I
plus
one
work
items
I
and
I
plus
one,
were
accessing
elements:
j
and
j
minus
one
instead
of
j
plus
one
and
an
older
hardware,
that
would
have
been
a
problem
on
modern
hardware.
It's
not
a
problem,
but
it's
kind
of
a
very
simple
use
of
a
of
local
memory.
A
So
item.barrier
this
is
ensuring
that
all
of
the
work
items
in
a
particular
work
group.
They
wait
until
every
work
item
in
that
work.
Group
has
reached
this
point
and
then
they
proceed
to
the
next
step.
So
in
this
instance
this
is
a
really
really
good
question.
So
in
this
instance
we're
writing
to
local
memory.
A
Okay,
and
then
we
want
to
read
from
local
memory
if
we
didn't
have
this
item.barrier,
because
this
work
item
is
writing
to
local
idx
and
reading
from
workgroup
size,
minus
local
idx
minus
one
is
reading
from
a
completely
different
space.
It's
reading
from
something
that's
been
written
to
buy
another
work
item.
We
need
to
be
able
to
guarantee
that
the
other
work
item
has
finished
writing
to
that
space
in
in
local
memory.
So
the
only
way
that
we
can
do
that
is
by
using
a
barrier.
A
So
this
just
synchronizes
all
of
the
work
items
in
a
work
group,
but
it's
only
within
that
work
group,
not
within
the
larger
device
or
anything.
We
have
no
way
of
doing
that
on
the
larger
device,
except
for
just
having
one
kernel
and
then
finishing
the
kernel
doing
a
new
kernel,
so
the
the
overhead.
There
is
some
overhead
in
that.
A
If,
if
work
items
have
diverged
slightly
or
so
on,
cuda
hardware
work
groups
are
organized
in
warps,
which
are
groups
of
32,
so
warps
might
be
executing
a
slightly
different
speed.
So
it
might
just
so
happen
that
this
warp
is
slightly
ahead
and
it's
you
know
running
ahead
by
a
few
cycles
or
whatever,
and
then
it
needs
to
wait
for
this
other
one
to
catch
up,
but
the
performance
gain
of
using
shared
memory
is
really
worth
this
barrier.
A
It's
worth
the
you
know
having
to
wait.
This
say
warp
or
but
also
in
warps,
because
you
have
independent
forward
progress.
Sometimes
within
work
groups
there
can
be
diversion
control
flows,
so
you
do
need
to
you.
Do
need
to
call
weight
as
well
within
warps,
but
yeah.
The
the
overhead
is
worth
it
essentially
because
you're
using
very,
very
fast
memory.
C
A
C
A
So,
by
by
local
memory,
if
you
mean
cuda
shared
memory
which
is
requires
this
synchronization,
then
you
can't
share
that
among
work
groups
on
the
same
device,
let
alone
different
devices.
A
So
not
necessarily
from
writing
to
the
same
memory.
It's
just
you
have
two
things
happening,
asynchronously
so
essentially
or
not
asynchronously,
but
concurrently,
so
one
is
going
to
write
and
then
essentially
you
want
this
one
to
read
as
soon
as
it
has
been
written.
This
is
a
way
of
enforcing
that,
but
there's
this
event
that
every
work
item
needs
to
get
to
before
it's
allowed
to
read
it's
not
it's
not
necessarily
it's
not
about
processes
and
we're
not
thinking
about
the
operating
system.
A
Really,
here,
that's
more
got
to
do
with
yeah
we're,
not
thinking
of
of
operating
systems.
Here
we're
just
thinking
about
device
code
essentially.
Well,
I
I'm
not
sure
how
operating
systems
interact
with
you
know
offload
devices
in
the
first
place,
but
no
all
of
this
is
allocated
within
the
program
anyway
and
within
an
individual
program.