►
From YouTube: 2. Enqueuing a SYCL Kernel
Description
Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/
A
And
queuing
it
colonel,
so
do
we
have
everyone
still
do
you
think
rod
we're
okay
to
go.
B
Yeah,
I
think
yeah
crack
on
yeah.
C
Can
I
can,
I
just
have
a
very
basic
question:
yeah
this
is
lucian,
so
I
so
what
is
then
the
role
of
sql
targets,
because
you
specify
that
compile
time.
I
suppose
it
builds
the
kernel
for
the
cuda,
but
then
you
can
actually
with
the
filter.
You
can
switch
what
you
want.
So
if
I
don't
specify
that
one,
it
means
that
if
I
change
the
sql
device
filter
to
be
gpu,
that
means
I'm
not
going
to
have
the
kernel
build.
So
I
suppose
you
can't
run
it.
A
Okay,
so
using
sql
targets
you
can
specify
multiple
sql
targets.
So
if
I
wanted
to,
I
could
compile
for
every
conceivable
back
end
and
I'd
be
producing
device
code
for
every
or
device
ir
for
every
conceivable
backend,
and
then
you
could
choose
a
run
plan.
So
the
the
compilation
and
the
device
selection
at
runtime
are
very
separate.
You
need
to
make
sure
it
uses
responsibility
to
make
sure
that
you
have
the
correct
device
code
before
a
certain
device
is
chosen
at
runtime.
C
C
So
now,
when
I
got
the
errors,
I
was
able
to
run
on
the
host,
but
I
got
errors
on
on
the
device.
That
means
it
was
an
error
with
whatever
gpu
was
using.
A
Exactly
exactly
yes
and
you
can
actually
see
that
you're
say
this
is
lucian.
Is
it
yes?
Yes,
yes,
yes,
sorry,
you
can
see
that
your
your
error,
you
know,
is
a
is
a
error,
that's
being
passed
by
the
plug-in
interface,
the
the
pi
code,
pi
cuda,
so
this
is
relating
to
the
back
end.
I
I'm
not
entirely
sure
if
this
would
change
if
you're
using
an
s
batch
script,
maybe
there's
something
got
to
do
with
the
permissions
of
your
of
your
account.
Potentially
I'm
not
sure.
A
Node
yeah
yeah
it
could
be
or
yes
so,
maybe
maybe
the
if
you
run
nvidia,
smi
or
or
something
you
might
see
that
there's
a
really.
You
know
serious
job
happening
on
that,
but
I'm
not
I'm
not
entirely
sure
that.
C
A
Brilliant
yeah,
thank
you,
okay,
so
I'm
going
to
crack
up
okay,
so
first
kernels,
okay,
so
again
sickle
c,
plus
plus,
but
with
you
know,
offloading
okay,
so
learn
learning
objectives,
learn
about
cues
and
how
to
submit
work
to
them.
Okay,
so
someone
very,
very
astutely,
asked
the
question:
how
does
this
map
to
a
computer
stream
at
the
moment,
one
to
one
in
the
future?
It's
not
gonna,
be
one
to
one
hopefully,
because
that
will
allow
for
more
concurrency,
so
learn
how
to
allocate
transfer
and
free
memory
using
usm.
A
D
A
A
So
the
queue
in
sickle
all
work
is
submitted
via
commands
to
a
queue.
The
queue
has
an
associated
device
that
any
commands
enqueue
to
it
will
target
okay.
So
when
you
construct
a
queue,
it
essentially
gets
some
device
by
some
manner.
It
might
be
a
default
thing
that
you
can
specify
with
some
device
filter
or
you
could
explicitly
ask
for
a
gpu.
A
You
can
also
write
your
own
device
selectors,
which
is
outside
the
scope
of
this,
but
you
could
feasibly,
you
know,
choose
a
a
device,
that's
only
a
cuda
device
that
has
a
certain
string
in
in
its
name
or
something
like
that.
You
can
define,
define
your
own
ways
to
select
devices,
so
there
are
several
different
ways
to
construct
queue
as
we
say
we're
going
default
construct
one
just
because
it
gives
you
a
lot
of
flexibility
at
runtime.
You
can
choose
devices,
so
this
will
have
the
signal.
A
Runtime
choose
a
device
for
you,
and
you
can.
You
know
override
this
using
your
device
filter,
as
we
saw
so
work
submitted
to
a
given
cue
can
be
executed
in
any
given
order.
This
is
important
in
general
when
we're
dealing
with
sickle
the
cues
are
not
necessarily
in
order
they
can
be
executed
in
whatever
way
the
runtime
thinks
is
optimal.
The
scheduler
thinks
is
optimal,
as
we
mentioned
earlier
at
the
moment,
because
we're
dealing
with
a
queue
mapping
to
a
cuda
stream.
This,
in
fact,
doesn't
really
cuda.
A
Streams
are
in
order,
so
at
the
moment,
cues
with
the
cuda
back
end
are
in
order,
but
this
is
liable
to
change.
So
it's
it's
good
to
pretend
they're
out
of
order
and
that
we
explicitly
specify
that
they're.
That
one
thing
follows
another,
so
it
is
necessary
to
define
a
given
task's
dependencies.
So
the
things
that
we
want
to
the
the
events
that
we
want
to
wait
on
for
before
the
next
event
happens.
So
we
can
call
wait
by
saying.
A
Okay,
don't
do
anything
on
this
until
this
has
happened
or
we
can
when
giving
a
task.
When
adding
a
task
to
cue,
we
can
say:
don't
do
that
until
this
event
has
finished.
That
event
has
finished.
A
Okay
and
sickle
events
are
returned
from
tasks
okay.
So
when
you
enqueue
a
task,
you
get
an
event
and
then,
when
you
wait
on
that,
you
can
wait
on
the
task
or
the
sorry,
the
event
or
the
actual
queue
itself,
so
constructing
cues.
A
So
here's
a
default
queue,
so
we've
actually
already
kind
of
lost
or
we've
already
gone
over
this
very
quickly
and
then
here's
a
queue
with
a
gpu
selector.
So
then,
as
we
saw,
this
will
turn
runtime
error
if
no
gpus
are
available.
A
So
in
sickle
there
are
two
models
for
managing
data:
the
buffer,
excessive
model.
We're
just
going
to
mention
this
because
we
don't
actually
have
time
to
get
into
this
today
and
unified
shared
memory,
so
unified
shared
memory
involves
explicit
allocations,
freeze,
sometimes
bank
member
copies,
but
not
always
so.
The
model
that
you
choose
can
have
an
effect
on
how
you
enqueue
kernel
functions,
we'll
see
that
in.
D
A
So
for
now
we're
going
to
focus
on
the
usm
model,
so
you
need
to
be
familiar
with
like
this
is
a
you
need
to
know
that
this
is
a
thing
in
cycle,
but
we
just
we're
not
going
to
cover
this
today,
maybe
in
a
future
workshop.
A
Okay,
so
here's
a
little
table.
This
is
from
the
dpc
plus
plus
book,
which
is
very
good
recommended.
A
So
when
you
malloc
on
device
that
pointer
that's
returned
is
not
accessible
on
the
host,
if
you
try
and
de-reference
it
or
access
the
data
within
that
allocation
on
on
host
you'll,
get
you'll,
get
a
seg
filter
or
a
legal
access
error,
it
is
accessible
on
device.
A
Okay,
if
you
do
a
host
malloc,
it's
accessible
on
the
host
and
also
on
the
device
and
it's
located
on
the
host
shared.
I
I
it's
not
recommended
necessarily
to
use
a
host
malloc
on
device.
If
you
wanted
to
share
an
allocation
between
device
and
host,
it's
better
to
use,
malloc
shared
and
this
is
accessible
on
host
and
device,
and
it
will
use
the
underlying
cuda
api,
say
malloc
manage,
which
will
allow
you
to
essentially
use
the
same
pointers
on
device
and
on
host
and
then
it'll
migrate.
Data
in
between.
A
So
malloc
device,
so
this
is
just
so.
You
have
two
versions:
you
have
a
c
version.
This
returns
a
void
star
or
you
have
a
templated
c,
plus
plus
version,
so,
depending
on
your
your
poison
yeah,
they
do
the
same
thing.
You
can
just
pass
in
a
template
parameter
maybe
a
little
bit
neater
if
you're,
if
you
prefer
c,
plus
plus
so
again,
this
is
only
accessible
on
the
device.
Any
pointer,
that's
returned
from
this
is
only
accessible
on
the
device.
A
So
both
these
functions
allocate
the
specific
specified
regional
memory
on
the
device
associated
with
the
specified
queue.
So
you
need
to
pass
in
a
sickle,
cue,
okay,
so
it
needs
to
be
associated
with
the
device,
so
a
queue
is
implicitly
associated
with
a
device
and
a
context,
but
well
most
importantly,
a
device
and,
as
a
result,
this
needs
to
be
part
of
the
malac
device,
because
the
the
malek
device,
because
you're
specifying
what
device
you
want
it
to
happen
on.
A
So
it's
only
accessible
in
a
kernel
function
running
on
that
device
very
important,
so
kernel
code.
Again,
this
is
the
the
device
code,
essentially
so
the
only
bit
in
our
sickle
code
that
is
going
to
be
run
on
device
is
the
kernel
function.
So
that's
the
only
place
where
we
can
access
this.
A
So
we
get
a
synchronous
exception
if
the
device
does
not
have
usm
device
allocations.
We
don't
need
to
worry
about
that
today
and
it's
a
blocking
operation.
That's
sort
of
important
a
lot
of
operations
are
not
blocking
in
cycle,
but
this
this
is
blocking
okay
malik
shared
yeah.
This
is
convenient,
uses,
malloc,
managed
and
then
the
pointer
is
accessible
using.
A
D
A
It
is
not
run
async
asynchronously,
it
is
blocking,
it
is
blocking,
so
it's
it
would
be
equivalent
to
and
queuing
it
to
the
queue
and
waiting
on
it
immediately,
but
we
haven't
gone
through
waiting
yet,
but
yet
totally
blocking
and
all
of
these
are
are
blocking
yeah.
All
of
these
malloc
x
functions
are
blocking,
so
this
is
convenient.
We
can
make
a
single
allocation
and
then
access
the
the
pointer
from
host
and
device,
and
then
the
the
api
will
migrate.
A
The
data
back
and
forth,
not
as
performant
as
doing
explicit,
mem
copies
because
of
the
mechanism.
The
underlying
mechanism
to
transmit
data
relies
on
page
faults.
Essentially
it
relies
on
one
device
asking
for
the
data
and
then
the
api
realizing.
Oh
it's
not
there.
A
Yet
now
we
need
to
go
get
it,
whereas
if
you
tell
the
the
the
api
to
send
things
explicitly,
then
usually,
if
you're
doing
a
lot
of
mem
copies,
it'll
be
more
performant
and
yeah,
could
I
look
managed
yeah,
potentially
slower,
okay,
three,
so
this
actually
should
be
signal
free.
The
signal
name,
space
signal
three.
A
So
in
order
to
prevent
memory,
leaks,
usm
device
allocations
must
be
freed
by
calling
the
free
function.
So
there
should
be
signal
freezes
in
the
signal
namespace.
A
If
you
just
use
a
normal
for
free,
which
is
part
of
your
normal
c
plus
library,
then
you
might
get
an
error
and
I
think
in
fact
you
will
get
there
and
dpc
plus
plus
okay,
and
this
is
also
blocking
and
the
queue
needs
to
be
the
same
yeah.
That
maybe
goes
to
that
saying.
A
Okay,
mem
copies,
so
this
is
important
if
you're
using
say
malloc
device.
So
let's
just
say
that
you
allocate
some
space
on
a
device
and
then
you
also
have
a
vector
you
want
to
copy
the
elements
from
the
vector
over
to
device
you
you
need
to
explicitly
copy
the
the
the
memory
over
yeah.
This
is
probably
straightforward.
A
And
this
same
function
is
used
regardless
of
what
direction
you're
going,
and
so
this
the
test
might
be
on
the
host
and
the
other
one
on
device,
or
vice
versa.
A
Yeah
copying
between
devices
between
cues
is
not
necessarily
allowed
unless
they
share
the
same
context,
but
that's
something
that
we're
not
going
to
cover
today.
At
the
moment,
we
just
want
to
think
about
host
device
device
to
host
another
important
thing,
actually
that
I
didn't
mention
in
the
previous
ones.
A
I
don't
know
sorry,
that's
my
log
shirt
yet
sorry,
okay
copy,
so
we
have
this
standard
vector
of
dependent
events,
so
we
can
actually
pass
in
a
vector
of
events
that
we're
waiting
on
so
that
this
will
not
happen
until
we
get
the
the
go
ahead
from
the
previous
events.
Okay
and
this
would
return
an
event
as
well.
So
actually
we
could
take
this
event
and
then
submit
it
to
the
dependent
events
of
the
next
kernel
and
so
on.
A
This
is
a
neat
way
of
doing
things
which
we'll
see
in
this
exercise.
That's
coming
up.
Okay,
so
pretty
much
all
of
these
q
member
functions
return
an
event
pretty
much.
I
think
all
of
them
do
so
it's
a
good
idea
either
to
wait
on
them
or
pass
them
on
to
subsequent
events
as
dependent
events.
Okay,
we'll
see
how
this
happens
in
the
next
exercise.
A
So
mem
said
this
is
just
setting
things
in
a
particular
allocation
setting
setting
the
value
for
num,
bytes
and
yep
and
then
phil
as
well
initialize
the
data
with
a
recurring
pattern.
A
A
We'll
see
some
examples
of
this
in
the
next
few
slides,
you
can
also
submit
a
kernel
as
a
parallel
for
and
this
will
be
executed
over
a
certain
range,
so
we're
just
going
to
deal
with
a
simple
range
now,
so
this
is
some
either
one
or
two
or
three
dimensional
object
which
says:
yeah:
let's
have
five
in
the
x
direction,
10
in
that
direction
and
100
in
the
one
in
the
z
direction.
A
A
Okay,
so
kernels
take
the
form
of
function,
objects
or
lambdas.
So
lambdas
as
we
say,
are
used
a
lot
in
sickle
yeah
quite
convenient.
If
you
ask
me,
the
queue
provides
member
functions
which
allow
you
to
invoke
a
single
task
or
a
parallel,
four.
Okay.
A
So
there
are
actually
later
on
in
the
workshop,
we'll
see
that
there
are
other
ways:
I've
been
cueing,
a
parallel
for
or
a
single
task,
but
these
are
kind
of
the
most
maybe
straightforward,
shortcuting
ways,
and
then
these
can
only
be
used
when
using
the
usm
data
management
model.
Yeah,
that's
that's
correct.
We
don't
necessarily
need
to
worry
about
that
at
the
moment.
A
Like
so
basic
signal
application,
which
used
uses
a
shared
usm
and
invokes
a
kernel
function
with
a
single
task,
so
shared
usm,
okay,
this
is
blocking
obviously,
because
so
it's
it's
blocking
and
another
reason
why
it's
blocking
is
because
it
needs
to
return
something
that
is
not
an
event.
Okay,
there's
no
way
that
malloc
shared
can
not
that
there's
no
way
that
malik
shared
can
return
an
event.
Therefore
it
needs
to
be
blocking.
A
I
think
that's
the
the
rationale,
so
the
we're
allocating
space
for
one
of
type
t
associating
it
with
this
particular
queue
and
then
we're
initializing
it
on
the
on
the
host.
A
Then
this
is
our
kernel
code,
dereferencing
it
on
the
kernel
code,
the
exact
same
pointer
and
then
we're
just
going
to
square
it
and
that's
fine
and
then
return
it
okay.
So
this
is
totally
fine.
We
don't
need
to
do
any,
so
we
need
to
wait
on
the
event.
That's
returned
from
the
single
task,
but
then
we
can
just
return
the
the
value
it'll
it'll
automatically
get
the
value
back
from
the
the
device.
A
Okay,
so
we
alice,
we
allocate
usm
device
memory
by
calling
malloc
device,
so
this
is
a
little
bit
more
involved,
so,
instead
of
just
calling
malik
shared
and
letting
the
the
api
do
all
the
work
with
pointers
for
you,
we're
gonna
explicitly
malloc
on
device,
which
means
that
we
need
some
mem
copy
as
well.
A
Okay,
anytime,
we
want
to
do
anything
on
on
device,
so
we're
mem
copying
to
the
device
pointer
whatever's
in
x,
I'm
not
sure,
what's
in
x
and
then
just
size
of
t
we're
going
to
square
it,
and
then
you
need
to
mem
copy
it
back.
Okay,
so
actually
I'm
not
sure
if
anyone
is
student
enough
to
notice
something
that
might
necessarily
go
right
with
this.
A
Exactly
exactly
yes,
well
done!
Well
done!
Yes,
yes!
Well
done,
yeah,
brilliant
yeah,
so
essentially
a
queue
can
be
out
of
order.
Okay,
so
there's
no
saying
that
this
you
know
will
happen
after
that
will
happen
after
that,
we
need
to
actually
define
the
dependencies.
Okay,
so
we'll
we'll
go
on
to
that.
Next
yeah
well
done!
A
Go
wait
then
it'll
happen
kind
of
in
order,
or
you
know
we
will
wait
until
each
is
exited.
Okay,
a
little
bit
more
elegant
if
we
can
use
explicit
dependencies
from
events,
because
it
means
that
we
don't
have
to
have
a
linear,
dag,
a
kind
of
whatever
completely
one-dimensional
dag.
A
E
A
A
You
can
have
a
complex,
dag,
okay,
let's
see
something:
okay,
so
with
just
weight,
you're
forced
to
wait
one
after
the
other,
whereas
if
you
explicitly
name
your
dependencies
using
events
and
some
vector
dependencies,
then
you
can
essentially
have
a
an
arbitrarily
complex
tag.
Okay,
and
this
this
will,
you
know,
make
a
lot
of
difference
in
terms
of
writing
performance
code
yeah.
It
concurrency
is
obviously
very
important,
so
yeah.
We
need
to
do
this,
so
then
this
would
depend
on
event.
One.
A
This
would
also
depend
on
event,
one
they
would
have
no
dependency
on
each
other,
so
they
can
happen
synchronously
or
not,
synchronously,
sorry,
they
can
happen
concurrently
and
then
this
depends
on
both
of
them.
Yeah
kernel
function
rules,
so
they
must
be
defined
using
a
c
plus
plus
lambda
or
a
function.
Object
cannot
be
a
function
pointer
or
a
standard
stood
function.
So
this
is,
as
I
said
earlier,.
A
So
you
need
to
use
a
c
plus,
plus
lambda
or
function
object.
I
would
personally
recommend
lambda,
but
it's
a
matter
of
taste
must
always
capture
or
store
members
by
value.
This
is
very
important.
So,
when
you're
defining
your
single
task,
you
need
to
use
by
value.
Okay,
you
can't
pass
things
by
reference
into
a
into
a
kernel
because,
well
certainly
with
say
with
malik
shared
or
whatever
you
might
be
dereferencing
things
in
the
wrong
way.
A
You
want
to
pass
them
by
value
and
that
will
adjust
them
to
be
submitted
by
the
so
that
they're
appropriate
to
be
run
on
the
device.
A
Yes,
so
you
can
name
your
lambda
if
you
want
you
don't
actually
have
to
so.
This
is
a
dpc
plus
plus
extension.
You
used
to
need
to
name
your
lambda,
which
we'll
go
through
later.
It
could
be
really
useful
when
you're
profiling,
but
you
don't
have
to
anymore
so
cycle
kernel,
function
names.
A
They
need
to
be
unique
as
well,
but
we
don't
need
to
think
about
naming
them
at
the
moment,
so
sickle
kernel
function,
restrictions,
so
no
dynamic
allocation.
Okay,
no
dynamic
polymorphism,
no
function,
pointers,
no
recursion!
Okay,
these
are
sort
of
you
know,
set
in
stone
kernels
as
function
objects.
Okay,
so
I
can.
I
can
so.
This
is
with
a
lambda
okay,
just
some
lambda,
which
is
being
passed.
We
see
by
value
okay,
but
we
can
also
use
a
function,
object.
Okay,
just
the
same.
It's
okay.
E
A
A
E
A
E
A
Can
a
sickle
kernel
is
that
that
can
a
signal
kernel
in
the
form
of
a
function
object
return
something
or
does
it
have
to
be
void
like
cuda
kernels?
That's
a
good
question,
so
yeah
gordon.
A
F
That
that's
right,
yeah,
I'm
not
entirely
sure
if
implementations
enforce
it,
but
it's
it's!
It's
expected
that
the
criminal
functions
are
void
any.
If
you
want
to
return
anything
from
a
kernel,
it
has
to
be
done
through
your
access
or
a
usm
pointer,
there's,
no
return
type
yeah.
A
You
definitely
try
it
out.
I'm
sure
that
potentially
you
can
submit
like
there's,
no
reason
why
you
necessarily
want
to
like
it's
not
like
a
kernel
or
a
function,
object
that
you
pass
into
a
kernel.
It's
not
like
you
want
it
for
another
purpose
as
well.
That
needs
it
to
return
something
other
than
void,
but
yeah
you
can
try
it
out,
try
it
out,
but
in
general
you
have
void.
G
A
Yes,
yes,
yes
exactly!
Yes,
so
yes,
so
this
is
we're
we're
using
single
task
at
the
moment,
but
the
other
one
for
parallel
is
parallel
for
so
this
particular
exercise.
A
We
were
just
going
to
be
dealing
with
single
tasks
just
so
we
can
get
our
head
around
submitting
things
getting
events
so
on,
but
a
parallel
four
can
be
defined
on
a
range-
and
this
is
just
a
simple
global
range,
but
we'll
see
later
on
that
there
are
more
complex
ways
to
construct
a
range
which
you
know
is
similar
to
the
idea
of
say,
threads
and
blocks,
and
you
know
grids
in
in
cuda.
A
But
this
is
just
a
naive
global
range
saying
how
many
work
items
do
we
want
we're,
not
worrying
about.
You
know,
work
group,
size
block
size,
we're
not
worrying
about
that
in
this
particular
instance,
but
there
we'll
we'll
get
on
to
that
later.
G
Yeah
arrays
as
arguments
or
as
captures.
E
A
More
time
a
race,
oh,
can
we
pass
them
as
as
arguments?
Yes,
certainly
yeah,
certainly
yeah.
So
it's
it's
easy
to
to
pass
and
buy
captures.
But
yeah
you
can
pacify
arguments
as
well.
Yeah.
G
A
A
These
kernel
submissions
that
are
not
going
to
be
they're
not
going
to
be
altering
the
the
say,
the
underlying
pointer
they're
just
going
to
be
acting
on
the
data
at
that
point,
so
you
don't
need
to
pass
things.
Well.
Actually,
sorry,
I
should
say
things
need
to
be
passed
by
value.
A
So
if
you're
passing
say
a
pointer
as
an
argument,
you
need
to
make
sure
that
you're
not
passing
it
by
reference,
because
we
don't
want
the
possibility
that
it
might
be
altered
by
the
kernel,
I'm
that
might
throw
an
error.
I'm
not
sure
if
you
tried
to
do
that
compiler,
but
yeah,
always
by
value
in
general,
and
it's
easy
to
do
that
with
just
a
value
capture.
Yeah
capture
is
very
easy
yeah.
A
A
A
vector
a
student
vector
okay,
so
you
don't
necessarily
want
to
pass
a
state
vector
to
your
into
so
the
stood
vector
lives
in
host
memory,
so
there's
no
way
of
accessing
that
from
device
from
a
kernel.
A
A
You
know,
allocations
that
are
on
the
host,
so
stood
vector
is
a
host
allocation,
even
though
you
know
it
doesn't
necessarily
call
itself
that
you
need
to
make
sure
that
anything
that's
being
used
on
the
device
has
a
device
allocation,
has
a
shared
allocation
using
the
buffer
accessor
model.
There's
a
quite
a
neat
way
that
vectors
create
buffers
which
are
then
accessed
with
excesses,
but
we're
not
covering
that
today.
F
Yeah,
I
think,
generally,
the
problem
with
using
stud
vector
is
that
it
can
kind
of
dynamically
reallocate
the
the
memory
which,
obviously
that
that
done
on
a
kernel
is
is
going
to
be
problematic
and
stood
array
is,
is
can
be
used
like
there's
quite
a
few
places
instead
of
re-used
internal,
but
I
think,
generally,
you
wouldn't
you
wouldn't
want
to
use
the
vector.
E
A
We
might
start
looking
at
the
next
exercise.
B
D
A
B
What
time
do
you
want
to
restart?
Personally,
I
think
we
were
scheduled
to
restart
a
little
while
ago,
but
we
can
shift
things
forward
a
bit.
E
Yeah
up
to
up
to
you
really.
B
I
guess,
let's
give
people
say
25
minutes
to
do
that
exercise
and
then
have
a
bit
of
a
break,
and
then
that
takes
us
to
brilliant
ten
past
ten
minutes.
Past
11,
pacific.
B
A
Okay,
okay,
so
let's
have
a
look
at
this,
this
okay,
so
this
is
our
latest
task.
Okay,
so
essentially
we
want
to
let's
look
at
the
readme
as
well.
A
Okay,
so
instructions
we
want
to
allocate
two
ins
on
device
a1
or
a
is
one
and
b
is
two
okay,
so
we
need
to
mem
copy
to
initialize
device
memory
for
a
deb
and
b
f,
so
yeah.
Now
we
want
to
use
a
single
task
to
multiply
a
dev
by
two
use.
A
separate
single
test
to
multiply
b
dev
by
100
then
use
another
single
task
to
add
the
results
of
both
together
and
store
the
value
in
a
then.
We
then
copy
the
value
back
to
host
and
print
this
standard
out.
Okay.
A
A
A
That
would
be
nice
not
essential,
but
this
is
a
nice
yeah,
a
nice
use
of
a
very,
very
simple
tag.
So
then
the
dependency
will
be
on
this
for
both
these
two
and
then
this
has
a
dependency
of
both
of
these
and
then
the
main
copy
will
have
a
dependency
of
that.
E
E
A
Allocate
device
memory
so
we
can
use
malik
device
or
maybe
malik
shared
if
you
like,
mem
copy
and
then
free
memory,
single
task.
So
on.
A
No,
no
sorry,
the
so
the
property
list
is
the
default.
Property
list
is
empty
yeah,
so
you
don't
need
to
worry
about
specifying
a
property
list,
so
these
are
they're
kind
of
there
just
in
case
at
some
stage
it
becomes
a
good
idea
to
to
implement
properly
lists
for
these
things,
but
I
I
actually
don't
think
that
there
are
any
defined
properties
that
you
could
pass
into
malik
device.
So
I
I
correct
me
from
wrong
gordon.
F
Yeah,
I
don't
think
there's
any
properties
available
you
can
use
at
the
moment.
Generally,
most
cycle
classes
can
be
constructed
with
a
property
list,
but
it's
in
a
lot
of
places.
It's
there
for
start
forward.
F
G
G
It
has
to
be
instar,
even
though
it's
just
a
scalar,
just
a
single
integer,
because
it's
a
pointer,
a
device
pointer.
A
H
Have
kind
of
a
general
question:
that's
all
right
about
the
sickle
yeah,
so
it
seems
like
a
lot
of
the
it
seems
like
it
really
relies
on
building
a
graph
with
the
right
dependencies
for
for
each
kernel
in
the
queue
or-
and
that
seems
like
an
easy
thing
to
mess
up,
just
like
to
forget
to
add
one
dependency
and
then
you're
left
with
some
very
difficult
to
debug
race
condition
of
some
sort.
Much
later
on
down
the
line.
Are
there
any
strategies
to
sort
of
help?
Do
this
correctly.
A
Definitely
if
you're
just
trying
to
get
code
working
just
use
weights.
If
you
wait,
then
this
will
enforce
this.
This
linear,
whatever
one-dimensional
chain
of
execution,
that's
a
an
easy
thing
to
do,
whereas
if
you
try
and
do
the
more
complex
things,
maybe
this
is
a
a
a
little
bit
more
subtle,
more
nuanced,
but
it
can
give
better
performance-
theoretically
at
least
well,
yeah,
theoretically,
but
yeah
in
general.
A
If
you're
trying
to
kind
of
remove
elements
of
asynchronousness
just
call
weight,
because
that
will
just
you
know,
it'll
it'll
wait
on
whatever
it
is,
so
you
know
in
fact
things
will
be
happening.
Sort
of
synchronously.
You
might
say.
H
So
yeah
that
makes
sense
thanks
and
I
guess
relatedly
do
you
find
that
you
know
medium
complexity
like
scientific
sickle
applications
do
end
up
with
very
complicated
branching
graphs
or
or
is.
Do
you
find
that
more
the
computation
is
embedded
in
the
kernel
such
that
you
do
have
a
pretty
simple
flow.
A
Well,
personally,
I've
been
working
on
say
some
deep
neural
networks,
libraries
recently,
which
is
what
I'll
use
but
yeah.
Definitely
there
is
an
element
of
concurrency
which
you
know
would
involve
this
kind
of
a
branching
sort
of
dag.
But
it's
not
necessarily
the
case.
It's
really
like
implementation
or
it's
really
application
dependent.
I
I'm
not
people
at
the
the
labs
can
maybe
answer
this
question,
but
better
than
I
can.
F
From
my
experience,
I
think
the
the
generally
when
you
see
the
like
the
dax
like
this,
is
when
you're
doing
sort
of
copying
data
whilst
doing
compute
the
same
times
like
double
buffering
things
like
that
or
you
know,
using
multiple
devices
and
doing
load
balancing
things.
So
that's
where
more
complicated
tags
tend
to
come
up
or
if
you're
sort
of
doing
interrupting
between
cycle
kernels.
For
things
like
you
know,
including,
and
something.
A
I
Maybe
to
to
go
back
to
the
dependency
chain,
I
know
you
will
not
talk
about
buffer
and
stuff,
but
one
of
the
big
advantage
of
buffer
is
handling
all
this
data
dependency.
For
you
automatically
right,
and
I
think
it
is
one
of
the
nice
advantages
like
in
theory.
The
runtime
can
be
smart
enough
to
do
all
this
kind
of
interleaving
and
just
put
the
correct
dependency
automatically
for
you,
and
I
think
it's
a
good,
a
really
good
thing
to
use,
but
parting,
your
code
using
buffer,
is
a
more
involved
indeed,
yeah.
A
Absolutely
yes,
I
should
have
mentioned
that
so
the
other
memory
paradigm,
the
other
memory
model,
the
buffer,
excessive
model.
It
pretty
much
does
all
this
stuff
behind
the
scenes
for
you,
so
you
don't
need
to
worry
about
it
at
all,
but
this
kind
of
explicit
dependency
naming
this
is
maybe
more
akin
to
other
yeah.
I
G
I
think
you
mentioned
sickle
weight,
so
weight
is
in
cypsical
name
space,
but
why
is
the
slide
24
saying
it's
a
method
as
well
of
a
single
task.
A
So
this
q
submission
returns
an
event,
and
then
you
can
call
wait
on
an
event
which
yeah
it
means
that
nothing
else
will
happen
until
this
is
returned.
D
A
This,
this
q
dot
single
task.
A
Oh
yeah,
nice
yeah,
okay,
so
actually
this
is
something
that
maybe
related
to
what
you're
saying,
but
let's
just
say
that
we
wait
and
we
want
to
assign
that
value
to
something
so
wait
actually
doesn't
return
an
event.
Okay,.
D
A
E
E
D
So
that's
a
that's
a
nice!
That's
a
nice
book.
I
I
Yes,
maybe
the
difference
between
the
two
are
the
granularity
right,
where
q
way
to
wait
for
all
the
comments
that
you
want
you
into
the
queue
to
finish
where,
if
you
wait
on
an
event,
you
wait
only
for
this
event
to
finish
right.
So
it
is
a
the
little
difference
between
the
two.
So
if
you
are
in
a
need
order,
queue
in
like
coulda
way
both
are
totally
equivalent
right,
but
if
you
are
more
in
the
out
of
order
way,
they
are
totally
different.
It's
not
the
same
granularity.
C
So
what
happens?
If
you
have
you
pick
up
the
event
with
the
variable
and
then
you
are
going
to
call
wait,
you
sign
it.
How
so
where's
the
weight?
Actually
acting?
It's
on
the
cp
on
the
host.
A
It's
essentially,
I
I'm
assuming
that
it's
an
interaction
with
the
the
plug-in
interface
which
interfaces
with
say
the
cuda,
the
cuda
driver
in
this
instance,
and
it's
saying,
okay
on
the
host:
let's
wait
until
we
get
the
plugin
interface
saying
yeah
kernel
completed
success,
so
yeah
we
can
wait
for
an
event
which
would
be
you
know
an
underlying
cuda
event
or
we
can
wait
for
the
queue
and,
if
we're
waiting
for
the
queue,
that's
waiting
for
essentially
everything
in
the
queue
to
complete.
A
C
If
you
have,
if
you
want
to
execute
multiple
copies,
for
example,
and
the
order
doesn't
matter,
you
could
potentially
just
just
launch
a
bunch
of
mem
copies
and,
and
it
doesn't
matter
like
you
can
just
you
don't
need
to
wait
for
them.
A
A
Exactly
but
in
fact
you
don't,
you
don't
need
to
worry
about
the
individual
tasks
right,
because
the
queue
has
a
record.
It
kind
of
has
a
hidden
record
of
all
the
tasks.
Sort
of
that
you
can
just
wait
until
all
the
events
have
completed.
C
And
if
you
want
to
pin
the
memory
on
the
device,
how
do
you
achieve
this
one
over
here
is
when
you
actually
create
the
queue,
I
guess
or
the
memory
when
you
actually
create
the
memory
with
molok
device?
I
guess
so
when
you
say
memory,
but
there
are
multiple
ways
in
which
you
can
allocate
memory
on
the
device
right.
A
F
I
could
probably
answer
to
this
so
the
moment
sickle
standard
sweat,
doesn't
have
an
explicit
way
to
to
do,
pin
to
memory,
but
because
it
stays
there,
it
can
kind
of
vary
from
one
kind
of
back
end
to
one
part
device
from
another,
but
generally,
if
you
allow
the
second
thing
to
allocate
the
memory
for
you
through,
like
not
device
not
and
as
long
as
sort
of
your
the
kind
of
the
size
of
memory
you're
allocating,
is
sort
of
along
the
lines
of
what
you
would.
F
The
kind
of
that
platform
would
recommend
in
terms
of
you,
know
the
size
of
memory,
the
cashling
like
multiple
caches
and
that
kind
of
thing,
and
that
should
it
should
allocate
it
in
memory
for
you.
So
it's
kind
of
a
quality
of
implementation
detail.
I
think
there
has
been
some
interest
in
a
way
to
kind
of
properties
to
be
able
to
explicitly
request
that
allocations
are
pinned
and
that's
something
that
we
may
see
in
the
future.
But
at
the
moment
there's
not
an
explicit
way
to
to
do.
It
is
more
implementation.
F
F
I
And
maybe
one
a
general
comment
is
because
you
do
just
cuda
code
at
the
end,
you
can
just
use
nv
pro
for
whatever
tracer
you
like,
and
you
can
verify
how
the
map
right.
So
this
is
also
the
good
thing
with
all
this
offloading
model,
or
something
like
that.
This
is
like
you
can
always
check
what
the
back
end
is
doing
so
at
the
end
is
nothing
is
magic
and
you
can
check
if
indeed
they
pin
memory,
for
example,.
C
What
happens
if
you
run
this
one,
let's
say
on
a
knl,
so
you
might
have
cpu,
you
might
run
on
knl.
So
then
you
also
have
different
types
of
memory
right
where
how
do
you
control,
which
or
even
on
the
gpu
right
like
if
you
are
using
texture
memory?
B
A
So
using
buffers
and
excesses
you
don't
have
the
same
control
when
you're
just
dealing
with
malik
device
and
that
kind
of
thing
as
to
where
your
your
memory
is
actually
what
kind
of
memory
that
you're
using
so
for
that
the
buffer
accessory
model
is
better
we're
going
to
be
looking
at
using
cuda
shared
memory
in.
I
think,
then,
if
not
the
next
one,
then
the
one
after
that
so
you'll
you'll
kind
of
see
how
this
is
done.
But
it's
using
yeah
buffers
accessories
well,
not
necessarily
buffers,
but
accessories.
C
And
if
you
are
going
to
run
this
code
on
the
host
after
that,
essentially
is
going
to
skip.
I
guess
this
this
step
of
copying
the
memory
or
what
does
it
do
or
it
does
a
local
copy
to
the
memory
or.
C
D
A
A
F
And
so
I
I
believe
with
with
usm,
because
it's
explicit
it
will,
it
will
still
perform
the
copy,
even
if
it's
strictly
unnecessary
and
with
the
with
the
buffer
accessory
model.
It's
a
bit
more
forgiving
in
the
with
the
buffer
access
is
rather
than
sort
of
explicitly
kind
of
prescriptively
saying
what
you
want
to
be
allocated
copied
when,
where
you're
kind
of
describing
the
the
requirements
in
terms
of
kind
of
what
memory
you
want
where
and
when
and
then
the
runtime
kind
of,
does
the
efficient
thing
for
you.
F
F
A
A
I
think
we
might
start
the
next
section,
maybe
I'll
go
through
the
the
example
very
quickly.
A
Okay,
so
yeah,
essentially
yeah
here
we
have
a
and
b
construct
queue,
allocate
memory
on
the
device.
Okay,
just
size,
one
mem
copy
to
both
okay,
we're
getting
the
return
values
from
both
okay,
not
necessarily.
A
Essentially,
we
could
also
just
do
something
like
you
know,
a
general
q
dot
weight
here,
but
yeah.
A
We've
done
this
okay,
this
has
a
dependency
on
e1.
This
single
task
has
a
dependency
on
e2.
Okay,
we're
also
getting
the
events
that
are
returned
from
these,
and
then
we
have
another
single
task
which
has
both
of
the
individual
single
tasks
for
the
dependency,
then
we're
just
going
to
add
the
two
together.