►
From YouTube: 3. Data Parallelism
Description
Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/
A
Okay,
so
data
parallelism,
so
this
is
what
it's
all
about
really,
yes,
so
obviously
we
want
to
use
offloading
devices
because
it
allows
us
to
do
stuff
in
parallel,
so
yeah.
This
is
important.
A
So
in
this
section
we're
going
to
learn
about
task,
parallelism
and
data
parallelism
learn
about
the
spmd
model
for
describing
data
parallelism.
So
this
is
the
single
process.
Multiple
data
model,
this
kind
of
flynn's,
taxonomy
thing
above
learn
about
single
sickle,
execution
of
memory
models,
learn
about
inquiring
kernel
functions
with
parallel,
four
okay,
so
task
parallelism
is
where
you
have
several
possibly
distinct
tasks
executing
in
parallel
in
task
parallelism.
You
parallel
parallelism,
you
optimize
for
latency.
We
want
low
latency
data.
A
Parallelism
is
where
you
have
the
same
tasks
being
performed
on
multiple
elements
of
data
in
data
parallelism
you
optimize
for
throughput,
so
we're
mostly
kind
of
dealing
with
data
parallelism.
Here
many
processes
are
vector,
processors,
which
means
that
they
can
naturally
perform
data.
Parallelism
gpus
are
designed
to
be
parallel,
cpus
have
simply
instructions
which
perform
the
same
instruction
on
a
number
of
elements
of
data.
A
Doing
some
sort
of
a
loop,
okay
and
then
a
parallel
spmd
code
and
then
we're
just
defining
this
for
a
single
iteration
in
this
kind
of
iteration
space
and
you're
doing
a
parallel
four
for
this
okay,
so
in
sickle,
kernel
functions
are
executed
by
individual
work
items.
So
these
are
the
smallest
unit
of
work.
A
A
B
A
A
So
the
size
of
the
work
groups
is
generally
relative
to
what
is
after
on
the
device
being
targeted.
Yes,
you
don't
need
to
specify
work
group
size,
but
you
can
manually
specify
if
you
want,
but
if
not,
then
there
are
some
heuristics
to
choose
good
good
work
group
size.
A
It
can
be
also
affected
by
the
resources
used
by
each
work
item.
Yes,
okay,
so
sickle
kernel
functions
are
invoked
within
an
nd
range.
Okay.
So
essentially,
this
work
item
is
grouped
in
a
work
group
and
then
this
in
turn
is
kind
of
whatever
inside
this.
The
the
next
step
up
in
the
hierarchy
is
the
nd
range,
so
in
cuda
this
would
correspond
to
threads
blocks
and
then
a
grid.
A
So
an
nd
range
is
composed
of
the
dimension
of
the
work
group,
as
well
as
the
global
dimension,
the
global
size,
the
global
range.
Okay,
an
nd.
A
A
Okay,
so
this
is
an
instance
of
an
nd
range
okay.
So
this
is
the
the
global
range
12
12..
So
we're
not
saying
how
many
work
groups
we
want
we're
saying
how
many,
how
many
work
items
in
total
in
the
global
space
we
want.
A
So
we
have
one
two,
three,
four,
five,
six,
seven,
eight,
nine
ten,
eleven
or
yeah,
sorry,
twelve
and
then
the
same
in
that
direction.
So
the
number
of
work
groups
is
inferred.
It's
not
handed
to
the
constructor
for
an
nd
range.
It
could
be
one
two
or
three
dimensions.
That's
that's
important!
So
each
of
these
could
be
one
two
or
three
dimensions.
A
It
has
two
components:
the
global
range
this
bit
and
the
local
range.
Okay,
then
the
size
of
the
work
group
essentially.
A
So
multiple
work
items
will
generally
execute
concurrently,
yeah
so
usually
on
on
offload
devices.
It's
useful
to
imagine
that
these
are
all
happening.
The
these
are
executing
in
complete,
lock,
step.
There's
an
ex.
There
are
exceptions
to
this,
but
this
is
a
good
thing
to
have
in
your
head
when
we're
writing
code
for
offloading
devices.
A
A
The
order,
work,
items
and
work
groups
are
executed
in
is
implementation
defined.
This
is
important,
so
essentially
you
could
have
a
large
nd
range
where
you
think.
Well,
it's
it's
nice
to
kind
of
understand
the
work
items
as
all
executing
in
parallel,
but
in
fact
you
might
have
these
work
groups
executing
and
then
those
work
groups
and
then
more
work
groups
and
so
on.
A
So
you
need
to
make
sure
that
any
rights
to
say
global
memory
or
yeah
that
that
you're
you're
not
going
to
be
you
need
to
be
careful.
We
can't
make
assumptions
that
things
are
happening
exactly
at
the
same
time.
A
Okay,
so
work
items
in
a
work
group
can
be
synchronized
using
a
work
group
barrier
yeah.
This
is
really
important.
So
if
you
have
some
work
items
in
a
work
group
executing,
they
might
sort
of
fall
out
of
lockstep
and
also,
if
especially,
if
they're
larger
work
groups
they
might
be
executing
concurrently
or
at
the
same
time,
at
all
so
imposing
a
barrier
means
that
essentially
all
of
the
work
groups
need
to
arrive
to
the
same
point
before
they
can
get
to
the
next
step.
A
So
this
is
done
with
the
the
the
function
barrier,
which
is
a
member
function
of
item
or
a
work
group
sql
does
not
support
synchronizing
across
all
work
items
in
the
nd
range.
Okay.
This
is
also
important,
so
we
can't
have
a
global
sink
of
all
work
items
in
our
nd
range.
If
this
is
something
that
we
want-
and
this
is
something
that
we'll
be
seeing
actually
in
the
final
exercise-
if
this
is
something
that
we
want,
then
we're
better
off.
A
Writing
two
separate
kernels,
so
kernels
writing
different
kernels
splitting
the
computation
across
multiple
kernels.
That's
a
way
to
guarantee
some
sort
of
global
synchronization.
You
wait
until
something
is
completely
finished.
Then
you
do
a
new
kernel.
You
submit
new
kernel.
A
A
So
if
this
work
item
here
writes
to
a
value
in
local
memory,
then
this
work
item
can
read
it,
but
we
need
to
be
very,
very
careful
that
this
work
item
is
only
reading
it
after
this
one
has
written
to
it.
So
if
we're
writing-
and
we
want
it
to
be
read
later-
we
might
write
then
do
a
barrier
to
make
sure
that
everything
is
has
happened
and
then
we
we
might
read
with
with
the
other
work
item.
A
We
also
have
global,
constant
memory,
which
is
when
we
do
a
device
malloc.
This
is
the
the
standard,
well
global
memory.
A
The
standard
we
can
ask
for
constant
memory
using
accessors.
We
need
to
ask
for
local
memory
using
accessories.
We'll
do
this
at
the
end
of
the
the
next.
The
next
lesson:
okay
and
the
cuts
of
memory
is
read
only,
but
we're
not
really
going
to
be
dealing
with
that
at
the
moment.
A
Also.
So,
let's
just
say
that
this
work
item
writes
to
global
memory
and
then
this
work
item
wants
to
read
that
exact
same
thing:
that's
been
written
into
global
memory.
We
can't
do
that
safely,
there's
no
way
of
doing
that
in
a
yeah
in
a
safe
manner.
A
So
it's
better
to
split
this
up
into
two
separate
kernels,
where
in
the
first
kernel
this
thread
right
to
that
and
then
the
second
kernel
that
value
is
read
by
some
other
thread
by
some
other
work
item:
okay,
so
a
parallel
four
okay,
so
we
can
so
this
is
a
member
function
of
a
command
group
handler,
but
we're
just
using
a
cue
in
our
examples,
so
you
define
it
on
a
range
okay.
This
is
just
a
normal
range.
A
It's
not
an
nd
range,
so
an
nd
range
so
think
nested,
an
nd
range
corresponds
to
defining
the
global
range
and
also
the
work
group
size
in
this
case
we're
just
interested
in
the
global
range.
So
this
is
not
an
nd,
an
nd
range
so
into
this
member
function.
You're
sorry
into
this
lambda
you're
capturing
things
by
value
and
you're,
defining
an
an
index,
an
index
argument,
and
this
index
is
really
useful.
This
index
can
be
essentially
tells
you
the
the
position
of
the
thread
within
this.
A
This
range
of
threads,
yes,
okay,
and
we
can
see
as
well
that
this
is
a
two-dimensional
range
and,
as
a
result,
our
ids
are
two-dimensional
as
well.
Okay.
So
this
is
some
two-dimensional
object.
We
can
get
the
individual
dimensions
by
using
just
array:
access,
zero
and
array
axis
one:
okay,
yes,
okay,
so
this
is
taking
a
single
id
and
this
can
be
used
to
find
its
position
within
the
iteration
space.
A
Okay,
so
this
is
a
parallel
for
taking
a
range
object
and
this
is
one-dimensional.
Obviously,
so
the
id
is
one-dimensional
as
well.
Okay,
we
can
also.
So
this
is
a
sickle
id.
This
is
just
going
to
be
a
kind
of
sort
of
like
a
tuple
which
is
either
a
one
two
or
three
tuple,
which
tells
you
the
the
position
of
the
thread
within
the
space.
If
you
wanted
to,
you
could
also
get
a
sickle
item
object,
and
this
has
a
little
bit
this.
A
A
I
would
point
you
to
the
signal
spec
to
look
at
all
the
member
functions,
which
are
great
okay
now,
in
the
final
one,
so
we're
using
an
nd
range,
so
a
nested
range,
so
you
have
your
global
range,
so
the
entire
global
range
is
going
to
be
10
24
by
10
24..
Sorry,
no,
it's
just
a
single,
a
single
dimension,
so
it's
just
10
24.
A
and
then
the
work
group
size
is
going
to
be
32.
The
local
range
is
going
to
be
32
and
then
you
pass
in
an
nd
item
which
is
similar
to
an
item.
I'm
sorry,
I
said
that
item
you
can
use
a
barrier.
You
can't
use
a
barrier
with
a
with
a
normal
sickle
item.
With
an
nd
item.
You
can
use
a
sickle
barrier
because
this
means
essentially
synchronizing
in
a
particular
work
group.
A
We
don't
necessarily
have
the
concept
of
work
groups
here,
because
we're
not
defining
a
an
nd
range,
we're
just
defining
a
single
range,
okay,
yeah
and
we
can
get
lots
of
nice
stuff
from
this
nd
item.
So
this
yeah
again
points
you
to
the
sickle
spec,
okay,
questions.
A
A
Where
the
this
is,
this
is
outside
the
scope
of
the
workshop.
Today,
almost
all
cuda
features
are
implemented
in
dpc
plus
plus
and
the
ones
that
are
the
newest
ones.
These
are
the
ones
that
we're
currently
working
on
so
they're
they're,
like
we
are
quite
fast
to
to
implement
things,
definitely
yeah
definitely
consult
the
spec
for
week.
We
can
post
this,
maybe
in
the
in
the
slack.
B
A
That's
defined
by
the
implementation
in
sickle
code.
If
you
have
work
items
reaching
a
barrier
using
kind
of
different
branches,
this
is
actually
undefined
behavior,
but
within
with
the
dpc
plus
plus
back
end,
it's
it.
It
usually
agrees
with
the
the
cuda
behavior
gordon.
Is
that
correct?
Am
I.
C
Yeah,
so
this
is
this
is
something
we're
kind
of
working
on
trying
to
expose
a
bit
a
bit
better
and
then
they
could
have
back
into
the
moment.
So
the
sickle
spec
currently
works
under
the
assumption
that
it
doesn't
make
any
guarantees
about
the
execution
of
work
items
within
the
work
group.
They
can,
you
know,
make
progress
in
kind
of
in
any
way
they
like
so
for
kuda.
You
know
lyrical
architectures.
C
They
they
can
move
with
independent
forward
progress,
but
when
it
comes
to
certain
operations
like
group
based
functions,
so
like
work,
group
level
or
subgroups
or
warp
level
functions,
these
generally
require
often
require
convergence
or
synchronization
and
obviously
in
cuda
execution
model.
Then
you
can
have
there's
a
lot
of
features
where
you
can
have
individual
threads.
You
know
bad
errors
and
copies,
and
things
like
that
happen
for
individual
threads
rather
than
in
warps.
C
A
Okay,
this
is
just
a
simple
vector,
add
just
using
parallel
force,
so
we
don't
necessarily
need
to
worry
about.
Nd
ranges,
just
just
the
global
range
will
be
fine.
Okay,
we
have
two
vectors
and
we
want
to
add
them
on
device.
Okay
and
we'll
be
checking
the
results
at
the
end.
Okay,
so
compute
this
in
parallel
on
the
single
device,
so
we
need
to
construct
q,
allocate
memory
copy
memory
to
device
use
a
parallel
for
it
to
add
the
two
derays
transfer,
the
memory
back
to
the
device
so
yeah.
A
We
don't
need
to
worry
about
an
nd
range
in
this
case,
just
a
global
range,
so
yeah,
and
then
it
might
be
worth
mentioning
as
well
that
a
global
range
we've
been
constructed,
saying
something
like.
A
A
Any
questions
post
them
at
the
slack
or
okay,
any
relations,
so
the
work
group
size.
So
no
the
work
group
size
is
more
akin
to
say
block
size.
The
it
doesn't
need
to
you
know
it's
it's
variable.
It
doesn't
need
to
be
exactly
a
work
or,
but
usually
you
want
it
to
be
multiples
of
a
warp,
usually
say:
32,
64,
128
256.
A
So
on
yeah,
one
important
distinction
between
the
could
the
grid
and
the
single
nd
range
is
that
I
think
I'm
correct
in
saying,
with
the
acute
grid
you're
saying
you're
specifying
the
number
of
blocks,
whereas
with
the
andy
range
we
don't
specify
the
number
of
work
groups,
we
just
specify
the
size.
C
A
I
supposed
to
be
just
just
wait,
I
think,
or
calling
wait
on
an
event
so
with
our
particular
the
particular
implementation
again
maps
a
single
queue
to
a
string
so
calling
weight
on
a
queue
will
just
it'll
wait
on
it'll,
wait
on
that
particular
stream,
which
is
going
to
go
to
stream
synchronized
in
the
future.
When
this
isn't
the
case
and
cue
maps
to
multiple
streams,
then
you'll
need
to
explicitly
kind
of
list
the
events
and
then
call
wait
on
those
I
think
or
you
could.
A
You
could
build
up
a
dependency
graph
within
your
within
your
application
and
then
call
wait
on
the
the
last
event
or
something
like
that.
So
you
can
kind
of
build
these
streams
theoretically
and
then
call
wait
on
you
know
the
last
one
of
those.
A
Is
there
a
way
to
visualize
the
graph
as
an
is?
Is
there
a
nice
tool
to
visualize
graph?
Yes,
as
far
as
I'm
aware,
there
is
not,
as
far
as
I'm
aware,
someone
correct
me
if
I'm
wrong.
D
A
The
name
of
the
tracer
so
we'll
be
we'll
be
looking
at
profiling
tools
in
the
final
section.
D
A
So
it
will
be
there
cool
it'll,
be
there
it'll,
be
there
absolutely
yeah
and
yeah
we'll
fly
through
the
next
section,
but
just
to
echo
what
courtney
said
so
we
have
a
pi
tracer,
so
the
plugin
interface
again
talks
to
the
the
back
end,
the
plugin
whatever
that
might
be
so
you
can
so
essentially,
this
is
saying
everything
that
the
plugin
interface
is
doing.
A
A
A
The
the
plugin
interface
is
doing
a
lot
of
things,
so
you
can
see
how
certainly
the
sickle
the
sickle
implementation
is
talking
to
the
back
end.
So
these
are
essentially
messages
passed
to
say,
cuda,
and
then
it
gets
by
success
or
whatever
and
then
yeah.
A
This
can
be
useful
for
if
you're,
if
you
think,
you're
getting
some
kind
of
an
error
in
in
the
way
that
the
plugin
interfaces
is
interacting
with
the
at
the
back
end
or
if
you
think
that
there's
an
error
in
the
back
end
somewhere,
then
you
can
easily
locate
it.
When
I
say
easily,
these
things
are
usually
difficult
to
find.
Well,
not.
C
B
A
A
B
E
I
have
a
question
before
we
move
on
yeah,
so
we
used
only
scalars
right
now,
but
is
also
possible
to
use
complex
object,
structures
or
classes.
A
E
A
These
arrays
are
accessible
to
the
the
to
kernel
code.
You
need
to
make
sure
that
these
are
either
malloc
on
device
or
that
you're,
not
just
passing
pointers
to
say,
host
memory
on
you
know
you
need
to
you
need
to
make
sure
that
these
things
exist
on
device
memory,
yeah,
cool
and
yeah.
That's
that's
kind
of
a
a
performance
question
as
well
in
terms
of.
Is
it
better
to
organize
things
in
terms
of
structs
of
arrays
or
arrays
of
structs?
And
usually
the
answer
is
structs
of
arrays.
A
C
A
A
So
we're
just
trying
to
get
the
the
whatever
the
the
index
the
only
index
that
there
is
really
as
a
global
id
and
then
we're
indexing
into
a
and
b
adding
that
and
saving
it
to
our
doing,
call,
wait
then
copy
back
and
check
the
results.