►
From YouTube: 1. Introduction to SYCL
Description
Part of An Introduction to Programming with SYCL on Perlmutter and Beyond in March 1, 2022. Slides and more details are at https://www.nersc.gov/users/training/events/an-introduction-to-programming-with-sycl-on-perlmutter-and-beyond-march2022/
A
So
welcome
everyone
to
an
introduction
to
programming
with
sickle
on
pearlmutter
and
beyond.
We're
really
excited
to
offer
this
training
event
and
to
host
host
coldplay
who's,
really
an
expert
on
encycle
to
to
help
help
us
learn
more
about
this
programming
model,
so
I'm
brandon
cook
I
work
at
nurse
in
the
application
performance
group
amanda
dufek
is,
is
also
in
the
same
team
and
then
helen
is
here
from
our
user
engagement
group
and
then
from
codeplay.
We
have
hugh
who's.
A
Gonna
lead
the
the
training
from
from
their
side,
so
yeah
and
and
gordon
brown
and
rod
burns
are
going
to
be
here
to
to
assist
as
well.
A
So
yeah
some
some
basic
logistics
so
make
sure
that
you
are
are
muted.
When
you
think
you
are,
you
know,
mute
when
you
join,
I
think
you
might
be
muted
by
default
if
you
join.
So
if
you
are
attempting
to
speak
and
no
one's
responding,
please
check
that
you're
off
mute.
A
Also,
we
ask
that
you,
please
rename
your
your
name
in
your
zoom
session.
Instructions
are
here
on
the
screen
and
this
this
really.
It
will
allow
us
to
streamline
helping
helping
you
out
if
there's
any
any
kind
of
support,
question
that
that
comes
up
during
the
course
of
the
the
training
event.
A
So
as
you've
probably
seen
from
the
the
pop-up,
you
know
live
transcription
and
full
view
of
the
transcript
are
enabled,
and
we
are
recording
this.
So
if,
if
you
do
not
wish
to
be
recorded,
this
is
your
your
notification
or
reminder
to
keep
your
your
video
and
audio
off.
A
If
you
want
to
just
listen,
please
ask
your
questions
in
the
general
channel
in
slack,
this
is
preferred
instead
of
the
the
zoom
chat,
here's
the
link
to
to
join
that
channel,
and
this
link
should
also
been
provided
to
you
via
your
email,
invite
for
this
event,.
A
And
I
guess,
as
I
mentioned,
the
slides
and
videos
from
this
will
be
made
available
after
after
the
event's
over.
So
if
there's
there's
a,
if
you
want
to
reference
this
content
later,
it's
going
to
be
available,
there
will
be
a
hands-on
component.
A
So
please
find
you
can
find
this
in
this
repository
on
github.
A
And
finally,
I'd
like
to
say,
I
invite
you
all
to
please
answer
the
the
following
survey:
that's
linked
here,
I'll
put
the
links
into
the
chat
momentarily
after
I'm
done,
sharing
screen
answering
this
survey
really
helps
us
improve
these
events.
For,
for
everyone,.
A
Okay,
so
using
perlmutter
all
if
you're
already
a
nurse
user,
you
will
have
been
added
to
the
end
train
one
project,
if
you
have
a
training,
account
those
expire
on
march
8th.
A
A
I
think
I'll
also
call
out
that
you
know
due
to
the
limited
number
of
nodes
available.
Please
prefer
to
use
the
the
batch
system
as
opposed
to
allocating
nodes
interactively
within
the
reservation
and
you
if
you're
in
the
training
account
you
can
also
access
the
regular
promoter
nodes,
just
with
the
same
account.
A
And
then,
if
you
go
to
docs.nurse.gov
systems,
slash
perlmutter
this
will
this
is
the
the
sort
of
landing
page
to
start
off
with
any
any
promoter
specific
documentation.
A
And
I
think
we're
gonna
we're
gonna
cover
this
again
in
more
detail,
and
there
will
be.
You
know.
I
think
the
slides
have
been
posted
in
the
chat,
but
we'll
also
we
don't
expect
you
to
copy
write,
rewrite
these
commands
down,
but
codeplay
has
prepared
a
module
for
us
to
use
with
the
compiler
and
we'll
we'll
get
into
to
actually
using
it
in
the
hands-on
portion.
A
A
But
if
you
do
have
a
need
to
mix
in
mpi
I'd
invite
you
to
please
send
me
a
message,
via
by
slack
and
I'll
see.
If
we
can,
I
can
figure
out
a
way
to
accommodate
that
for
your
application.
A
Okay,
so,
finally,
the
schedule
we're
gonna
start
with
the
introduction
and
then
discussing
about
how
to
actually
get
some
work
scheduled
on
the
on
the
gpus,
we'll
have
a
few
short
breaks,
and
then
we
will
have
some
discussion
on
profiling
and
debugging
and
then
an
open
question
answer
session,
followed
by
the
end
of
the
event,
and
so
with
with
that
thanks
everyone
for
attending
and
I'll
stop
sharing
and
hand
it
over
to
you.
A
C
B
Yeah,
we're
gonna
be
looking
at
some
kind
of
basic
features
of
the
language
and
hopefully
covering
as
much
ground
as
possible
in
this
short
window.
So
we
have
lots
of
materials
available
online,
so
this
will
kind
of
be
a
jumping
off
point
into
the
language,
but
also
using
the
language
in
yeah,
a
kind
of
like
serious
way
so
yeah
without
further
ado.
And
if
anyone
has
questions,
please
interrupt
me
at
any
time
you
can.
You
can
use
a
slack
you
can.
B
You
can
use
the
the
chat
in
the
zoom,
but
I
think
the
most
reliable
way
is
maybe
just
to
interrupt
me.
So
please
feel
free
to
interrupt
me
because
everyone
has
questions
all
the
time,
and
if
we
ask
these
questions,
then
everyone
gains
from
it.
So
definitely
please
ask
some
questions.
B
Okay,
so
yeah,
let's
get
stuck
in
okay.
So
what
is
sickle
so
this
this
introductory
chapter
will
give
us
an
introduction
to
sickle
and
also
in
to
the
compiler
that
we're
going
to
be
using
on
promoter,
okay,
so
the
learning
objectives
for
this
module
so
we're
going
to
learn
about
this
signal.
Spec
and
its
implementations
learn
about
the
components
of
a
sickle
implementation,
learn
about
how
a
sickle
source
file
is
compiled
and
where
to
find
useful
resources.
B
So
we're
kind
of
going
to
be
glossing
over
details
about
implementations
in
general
and
focusing
more
on
the
dbc
plus
plus,
which
is
the
intel
one.
Api
implementation
with
cuda
back
in
this,
especially
so
sickle,
is
a
single
source,
high
level
standard,
c,
plus
plus
programming
model
that
can
target
a
range
of
heterogeneous
platforms.
B
So
I
wouldn't
try
and
take
all
this
in
at
once.
Okay,
so
a
first
example
of
sickle
code.
So
this
is
just
what
sickle
looks
like
in
the
wild.
So
you
can
see,
I'm
not
sure
if
people
can
see
my
my
mouse.
B
Yes,
yes,
good,
okay,
good,
okay,
so
yeah,
essentially
we're
constructing
a
queue.
This
is
associated
to
some
device.
So
this
is
kind
of
you
know
some
kind
of
a
unit
of
work.
That's
like
a
list
of
things
to
be
done.
We'll
we'll
go
through
this
in
more
detail
later,
so
maybe
don't
try
and
remember
that
too
hard
we
malloc
some
some
memory.
Then
we
do
something.
We
sorry
we
initialize.
B
Then
we
do
some
some
kernel
codes,
so
this
is
essentially
a
sickle
yeah,
sickle
code,
you'll
notice
that
we
have
so
this
parallel
four,
which
is
kernel
code.
This
is
within
your
normal
c
plus
plus
file.
So
this
is
a
core
aspect
of
sickle
as
we'll
see
so
signal
extends
c,
plus
plus
in
two
key
ways,
so
device
discovery
and
information.
So
we
can
find
out
about
what
devices
are
available.
What
devices
we
can
choose.
B
You
know
heuristics
for
prioritize
this
device
over
that
device
device
control,
so
the
dispatching
work
to
a
particular
device
and
and
so
on
so
signal
is
modern,
c
plus
plus
so
sickle
is
essentially
built
on
c
plus
plus
modern
things,
like
templates
and
lambdas,
and
that
kind
of
thing.
So,
if
you
like
templates,
you
like
lambdas
and
you
use
them
a
lot,
then
you'll
like
sickle
yeah,
so
sickle
is
open
source,
multivendor,
multi
architecture,
okay,.
B
Is
single
source,
high-level
standard,
c-plus,
plus
programming
model
that
can
target
a
range
of
heterogeneous
platforms,
so
emphasis
on
the
single
source,
so
within
the
same
file
we
have
host
code,
cpu
code
and
the
device
code
so
code,
that's
supposed
to
code
that
we
want
to
target
say
you
know
a
gpu
or
some
some
other
offloading
device.
B
So
we
have
kind
of
parallel
host
compiler
device
compiler,
which
are
then
linked
together
to
make
a
bundled
executable,
we'll
go
through
this
a
few
more
times.
So
the
main
idea
is
that
there
are
two
separate
compilers.
You
have
your
device
compiler
and
also
your
your
host
compiler.
There
are
many
kind
of
compilation
processes,
so
we
don't
need
to
get
too
bugged
on
that,
I'm
going
to
be
speaking
in
a
second
about
the
the
cuda
specific
compilation
model
for
dpc,
plus
plus
high
level.
B
Okay,
so
sickle
is
a
high
level
language.
It's
based
on
modern
c
plus
and
it's
it
doesn't
add
any
any
language
features
that
aren't
already
in
the
language
high
level
abstractions
over
common
boilerplate
code,
which
is
a
great
thing
if
we're
used
to
dealing
with
things
like
maybe
opencl
or
more
low
level
kind
of
apis.
B
So
it
gives
us
a
high
level
abstraction
over
common
boilerplate
code
for
device
selection
platform,
selection,
kernel,
function,
compilation
and
dependency
management
and
scheduling.
This
is
something
that's
quite
natural
and
sickle.
The
api
really
allows
us
to
do
this
quite
elegantly,
so
this
is.
This
is
one
of
the
great
things
about
using
sickle.
In
my
opinion,
standard
c
plus
plus
so
sql
doesn't
add.
Any
language
features
that
aren't
already.
E
B
The
language
they're
implemented
in
the
back
end
in
you
know
particular
ways
by
an
implementation
but
they're
just
normal
language
features
that
use
things
like
lambdas
and
templates
to
write
kernel
code,
unlike
things
like
cuda
opencl,
you
know
other.
There
are
no
pragmas
like
in
openmp
that
kind
of
thing
yeah
and
we
can
target
a
range
of
heterogeneous
platforms.
So
this
is
another
really
great
thing
about
sickle.
B
We
can
take
the
exact
same
code
and
it
can
run
on
as
many
back-ends
as
are
supported,
essentially,
unless
we're
doing
very,
very
specific
things,
particularly
to
some,
maybe
cuda
specific
to
some
particular
hardware.
But
in
theory
we
can
target
cpu
gpu
apu
accelerator,
fpga
dsp
we
can.
We
can
target
loads
and
loads
of
things.
So
this
is
a
really
yeah.
There's
a
good
thing
about
sickle.
Definitely
the
interchangeability
of
what
we're
actually
offloading
to.
B
So
the
sickle
spec,
the
first
version
of
the
second-
was
sickle
1.2,
we're
currently
on
sickle
2020.
B
So
this
this
the
spec
has
been
defined
and
the
imminent
implementations
are
they're
they're,
almost
fully
implemented
sickle
2020.
None
of
them
have
completely
finished
implementing
the
the
spec,
but
we're
pretty
close
to
done
in
terms
of
daily
use.
It's
you
know
it's.
You
wouldn't
know
that
the
entire
spec
isn't
implemented.
It's
you
know
very
workable,
useful.
B
So
here
are
an
overview
of
some
implementations.
We're
going
to
be
focusing
on
one
api,
dpc
plus
plus
particular
with
particularly
with
the
the
cuda
backend
codeplay
also
has
its
own
sql
compiler
called
compute
cpp,
but
yeah
we're
going
to
be
focusing
on
this
one.
B
Okay,
so
what
a
sql
implementation
looks
like,
so
the
sql
interface
is
a
c
plus
template
library
that
developers
can
use
to
access
the
features
of
sickle.
So
this
is
this
kind
of
box
here.
This
box
shouldn't
be
highlighted
this
box
here,
so
the
languages
it
kind
of
is
used
in
c
plus,
just
through
templates
through
just
standard
c,
plus
plus
features.
So
this
it
looks
like
just
normal
code
like
you're,
using
a
normal
library,
so
the
same
interface
is
used
for
both
the
host
and
device
code.
B
Yes,
so
this
is
important,
just
it's
it's
all
c
plus,
so
the
host
is
generally
the
cpu
and
it
is
used
to
dispatch
the
parallel
execution
of
kernels.
So
this
host
device
kind
of
it
controls
or
it's
the
your
standard
kind
of
cpu
serial
device
that
yeah
it's
your
the
the
standard
cpu,
I
suppose
execution.
B
You
know,
model
executor
thing,
that's
a
bit
of
a
random
way
of
saying,
but
yes,
the
device,
then
is
your
accelerator
you're
offloading
processor
like
a
gpu
like
an
fpga
like
whatever
that
might
be
so
the
runtime
library
schedules
and
executes
work?
B
Okay,
so
this
library
here
so
it
loads
kernels,
it
dispatches
them
to
whatever
offloading
device
you're
using
and
it
schedules
the
the
runtime
schedules,
which
kernels
should
be
dispatched
at
which
time
and
so
on,
and
it
tracks
dependencies
between
kernels
between
kernels
or
operations
like
mem
copies.
This
kind
of
thing.
B
Okay,
so
host
device,
so
this
typically
was
something
that
was
in
the
single
spec
for
a
single
1.2,
it's
no
longer
actually
in
the
spec,
but
it's
still
there
for
dpc,
plus
plus
at
least
this
is
kind
of
a
so
we
can
decide
to
run
our
code
using
the
host
device,
which
is
the
the
cpu
device.
So
you
can
interchange
these
devices,
which
is
great
for
debugging.
Certainly,
this
is
really
really
useful,
but
this
is
kind
of
provided
by
the
implementation
or
not.
It's
not
necessarily
a
core
language
feature
anymore.
B
So
this
is
a
great
thing
for
debugging,
but
once
your
code
is
working
on
the
host,
it
doesn't
necessarily
mean
that
it's
going
to
then
work
on.
You
know
an
offloading
device.
You
need
to
also
debug,
sometimes
usually
on
the
the
device
in
question
as
well.
B
Okay
and
then
the
back
end
interface.
So
this
is
where
the
signal
runtime
calls
down
into
a
particular
back
end
in
order
to
execute
on
a
particular
device.
So
for
dpc,
plus
plus,
this
is
called
the
plug-in
interface
and
it's
just
it.
It
talks
to
the
the
cuda
driver,
so
dpc
plus
plus
uses
the
the
cuda
driver
api
and
the
plug-in
interface
talks
to
it
sends
kernels,
builds
kernels
sends
them
off
and
then
you
know
awaits
for
response
and
that
kind
of
thing.
D
D
B
Of
separate,
so
this
would
generate
in
the
case
of
the
cuda
back
end.
It
generates
ptx,
which
is
the
cuda
good
assembly,
and
it
also
generates
a
cuda
binary,
which
is
puts
into
the
final
executable,
we'll
we'll
cover
this
again
later.
B
Okay,
so
our
standard
c,
plus
plus
compilation
model,
when
you
usually
compile
your
normal
code
or
whatever.
This
is
kind
of
glossing
over
a
few
things,
but
in
general
you
get
your
c
plus
source,
you
compile
it
into
an
object,
then
you
link
it
and
then
this
is
linked
with
multiple
objects
or
potentially
yeah,
maybe
static
libraries
or
something,
and
then
you
this
is
spun
into
some
executable
and
then
you
give
it
to
the
cpu.
B
Whatever
runs
it,
okay,
so
the
question
is:
how
do
we
do
this
for
both
the
cpu
and
an
offloading
device?
B
B
So
usually,
these
are
function.
Objects,
as
in
the
kernels,
the
kernels
are
function,
objects
or
lambda
expressions,
okay.
So,
by
the
way
this
is
phrased.
Maybe
unfortunately,
but
it's
not
a
standard
function,
forget
st
function,
it's
a
function,
object
or
a
lambda
expression.
It
can't
be
a
stood
function.
B
Okay
and
yes,
so,
let's
see
how
this
works,
so
the
sql
device
compiler
produces
device
ir
which
in
the
case
of
cuda,
is
ptx,
okay
and
the
default.
Actually,
the
default
is
spear
v.
So
this
is
a
device
ir
that
can
be
consumed
by
opencl
devices.
But
when
we're
dealing
with
cuda
as
we
are
today,
it
produces
ptx.
If
we
tell
it
to,
we
need
to
tell
it
to
as
we'll
see.
B
Then
this
device
ir
is
linked
with
the
cpu
object,
and
then
you
have
an
embedded
device
ir
within
this
kind
of
executable
binary,
and
then
you
can
dispatch
it
and
then
it'll
split
it
up
and
run
it
at
run
time,
okay,
yeah.
So
the
idea
is
that
you
have
these
kind
of,
like
independent
compilation,
streams
that
it's
only
when
they're
linked
together.
You
need
to
kind
of
bring
them
back
together,
yeah
so
now
dpc
plus
plus
recuda.
B
B
So
I'm
not
sure
if
everyone
has
had
the
chance
to
load
the
module
on
parallel
motor
or
if
you
have
access,
but
the
first
thing
that
we
can
do
is
just
to
check
that
our
install
is
working.
So
I'm
going
to
go
over
to
my
parameter
tab.
B
B
So
here
you
can
see
that
I'm
not
sure
if
that's
big
enough
so
sickle
less
is
this.
Is
you
know
the
first
kind
of
thing
that
we
should
use
when
we're
checking
what
devices
are
available?
Is
the
the
dpc
plus
plus
kind
of
installation?
Is
it
working
yeah?
Actually,
sorry
for?
B
Oh
sorry,
I
might
just
so.
This
is
the
command
or
the
commands
that
we
need
to
run
in
order
to
load
the
module.
D
B
D
B
These
are
the
devices
that
we
can
choose
from
okay,
so
we
can
see
that
there's
a
kind
of
we
have
a
few
different
kind
of
there.
It's
like
a
device
triple,
I
suppose,
so
we
have
x
one
api,
cuda,
gpu
and
then
zero.
This
is
as
in
this
is
the
the
first
device
of
this
particular
of
the
first.
You
know
two
entries
you
have
host,
which
is
you
know,
host
as
well,
and
it's
the
first
host
some
signal
implementations
have
multiple
hosts,
but
yeah.
B
We
don't
need
to
worry
about
that,
yeah,
okay,
so
to
use
a
dpc
plus
plus
compiler,
so
we
compile
device
code
to
spear
v.
This
is
the
default
using
just
fsickle
we're
not
necessarily
interested
in
that
at
the
moment,
we're
interested
in
compiling
for
the
cuda
back
end,
so
this
will
generate
ptx
and
also
ptx
binary.
So
if
we
do
this
with
that
copy
very
nicely,
so,
let's
see
and
then
I'll
just
do
test.
B
Earlier
so
compiling
is
okay,
but
running
code
should
be
done
using
sbatch,
but
yeah.
I'm
not
sure
if
I'm
allowed
to
do
that,
if
I'm
above
this
law-
maybe
yes
so
you're
running
on
this
device.
Okay,
and
actually
a
really
neat
thing
that
we
can
do
to
change
the
device.
That's
selected
by
the
runtime
is
use
signal.
B
B
That's
great:
okay,
let's
see
what
it
looks
like
if
we
compile
and
we
forget
this
flag
and
we
can
pop
for
sphere
severity
which
cannot
be
consumed
by
by
our
cuda
device.
Let's
just
see
what
kind
of
an
error
we
get.
Okay,
so,
let's
just
say
hey.,
so
we
know
that.
Actually,
when
you
run
this
without
specifying
which
device
should
be
chosen,
it
seems
that
the
default
is
the
cuda
device.
B
B
B
B
If
I
want
it
to
be
sm
80,
I
need
to
specify
it
okay,
so
under
the
hood
what's
happening
so
essentially,
this
is
the
same
thing
you
have.
This
is
a
parallel
kind
of
compilation
streams,
so
this
is
obviously
gpc
plus
plus
for
cuda.
So
you
have
your
cpu
object,
which
gets
produced
by
your
host
compiler
and
then
the
device
compiler
produces
ptx
assembly
and
actually
ahead
of
time.
B
The
ptx
assembler
is
also
invoked
to
create
a
cuda
object
file
and
then
the
cpu
object,
the
ptx
assembly
and
the
cuda
object
is
lumped
into
this
final
fat
binary,
and
this
is
great
so
essentially
when,
when
we're
running
our
fat
binary,
our
runtime
says-
or
at
least
the
cuda
driver,
says:
okay,
is
this
ptx
compatible?
B
Sorry,
is
this
cuda
object
compatible
with
my
compute
capability
and,
if
not,
then
the
ptx
assembly
will
be
jit
compiled
into
an
appropriate
binary
at
runtime.
So
this
essentially
means
we
don't
really
need
to
worry
about
the
arch
flags
that
much
so
like
what
was
happening
here
when
I
was
when
I
was
running
without
having
specified
the
arch
didn't
specify
the
arch.
B
So
actually,
there
was
a
device
binary
that
was
passive,
the
cuda
device,
the
cuda
driver,
took
it
and
said:
oh
actually,
I
can't
use
this,
but
it
doesn't
matter
because
I
have
the
ptx
as
well.
I'm
going
to
compile
this
into
sm80,
which
is
what's
needed
for
the
a100,
and
then
I
can
still
run
the
code
and
the
ptxj
compiler
as
well
uses
a
cache.
B
So
the
first
time
you
that
you
run
it
the
the
jit,
the
jit
will
will
happen,
but
the
second
time,
third
time,
whatever
you'll,
just
use
the
binary
that
was
cached
by
the
the
ptx
chip
compiler.
B
B
A
finite
amount
of
time
that
it's
given
to
jet
compile,
because
essentially
you
want
your
code
to
run
quickly,
whereas
the
ptx
assembler
has
a
theoretically
like
well,
not
finite
amount
of
time
that
it
can
use
to
do
various
optimizations
whatever.
So
it
might
happen
that
offline
compilation,
which
is
guaranteed
by
using
the
correct
the
correct
arch
flags,
it
might
happen
that
the
ptx
assembler
gives
you
a
little
bit
more
optimal
code,
but,
like
these
things
are
very
marginal,
so
you'd
need
to
you
need
to
profile.
B
You
need
to
test
a
little
bit
if
you
wanted
to
test.
If
you
wanted
to
test
the
ptxj
compiler
versus
offline
ptx
assembler,
you
would
just
pass
the
correct
arch
flags
for
offline
compilation
or
the
incorrect
arch
flags
for
jit
compilation,
which
is
a
bit
a
bit
kind
of
hacky,
but
that's
that's
how
it
works:
yeah,
okay,
so
yeah.
B
There
are
a
lot
of
things
going
on
here
like
this
is
kind
of
an
abstraction
as
well
we're
just
focusing
on
a
few
elements,
but
you
can
query
exactly
what's
happening
under
the
hood
by
passing
the
hash
flag.
So
let's
have
a
look
at
that,
so
if
I
just
do
climb,
let's
go.
B
B
D
B
B
B
Certainly
in
our
work
we're
trying
to
figure
out,
you
know,
what's
happening,
what's
going
wrong
in
various
compilation
process,
so
we
use
this
all
the
time
for
day-to-day
use
if
you're
just
implementing
things
in
cycle,
maybe
not
that
useful,
but
good
to
know
that
it's
there,
for
instance,
if
you
look
the
ptx
assembler,
so
this
is
where
the
ptx
assembly
is
invoked
and
we
can
see
that
it's
for
sm50,
okay,
because
I
didn't
specify
a
good
arch,
okay.
B
Okay,
another
really
useful
thing
to
kind
of
figure
or
to
to
get
intermediate
files
from
the
compilation
process
which
can
be
really
useful,
especially
if
you're
trying
to
say
compare
ptx
generated
by
dbc,
plus
plus
versus
ptx
generated
by
say
nvcc.
This
can
be.
This
is
really
neat.
If
you
want
to
do,
it
is
use
safe
temps.
This
needs
to
be
called
from
within
an
empty
directory.
So
let's
have
a
look
at
that.
B
Okay,
so
I
think
the
top
was
for
previously.
Okay,
so
I'll
just
make
the
attempt.
B
Yes,
you
need
to
make
an
empty
directory
doing
search
and
then
essentially
I'll
set
it
up
again.
Okay,
so
we're
gonna
save.
D
B
Okay,
so
we
can
see
this
yeah,
we
got
our
a
dot
out.
We
also
got
all
of
our
intermediate
files.
Okay,
so
things
like,
let's
just
say,
for
the
host
code,
we
got
our
pre-processed
file,
we
got
our
bit
code
llvm
bit
code
or
bytecode.
It
always
makes
these
up.
We
also
have
our
x86
assembly.
D
B
That's
just
for
host
code,
so
we
have
all
this
for
host
codes,
footers
headers,
so
on
fat,
bins,
yes
very
nice
and
then
for
the
the
cuda,
the
cuda
back
end.
We
have
our
cuda
object.
We
have
our
cuda
bit
code,
so
we
can
see
actually
the
the
ptx
has
passed
through
this
lvm
layer
beforehand,
so
we've
done
llvm
optimizations
on
the
on
the
device
code
as
well
before
we
actually
turn
into
ptx
and
then
the
ptx
gets
optimized
again.
B
So,
theoretically,
you
know
it's
being
optimized
by
two
separate
things
that
theoretically
it
might
be
more
optimal.
Who
knows?
We
also
have
so
we're
interested
in
our
ptx,
which
is
usually
in
a
dot
s
file.
So
we'll,
let's
open
this
up
and
see.
What's
there.
B
Now
it's
not
every
day
that
you'd
necessarily
be
using
this
okay,
so
here's
our
ptx
target
it
can
be
really
useful
if
you're,
trying
to
benchmark
things,
compare
performance
between,
say,
cuda
and
dpc,
plus
plus
for
normal
use,
may
be
overkill,
but
good
to
know
that
you
have
these
things
as
well.
B
If
you
think
that
there
are
bugs
that
are
due
to
a
particular
process
in
the
back
end,
you
can
get
your
ptx
file
pass
it
to
ptx
assembler
manually,
and
you
know
figure
out
if,
if
something
is
working,
not
working,
etc,
okay,
so
specifying
the
device
at
runtime.
So
this
is
just
using
our
single
device
filter
as
we
saw
okay
and
then
that's
everything
for
this
slide
questions.
H
Can
you
repeat
the
one
for
the
flag
for
requesting
the
specific
cuda
architecture?
Yes,
yes,
yeah.
Okay,.
B
B
So,
let's
just
do
our
previous,
so
our
last
ptx
was
for
sm50
because
we
didn't
specify
the
thing.
The
the
architecture,
but
if
we
do
x,
tickle
target
back
end
target.
B
80.,
okay,
we'll
save
temps
and
we'll
see
what
the
what
the
the
ptx
is
like
and
it
should
say:
sm
80
we'd.
Imagine,
let's
just
see
okay,
so
this
is
a
whole
new
set
of
things.
B
B
So
again,
whether
we
want
that
whether
we
want
our
device
binary
to
be
produced
ahead
of
time
or
at
runtime
by
the
the
jit
jit
compiler.
That's
up
to
you.
I
think
yeah,
probably
performance
differences
would
be.
You
know,
marginal
to
say
the
least.
B
Okay,
any
other
questions.
H
I
have
one
stupid
question:
can
sql
be
used
with
fortran.
B
Not
a
stupid
question
at
all.
Consequently,
used
with
fortran
sickle
is
used
essentially
as
a
we.
We
rely
on
c
plus
language
features,
so
it
is
it's.
It
sits
completely
on
top
of
c
plus
plus,
it's
a
part
of
c
plus
plus.
The
only
thing
that
makes
it
different
is
essentially
the
back
end
how
it
interacts
with
you
know,
whatever
back
end,
you
might
be
targeting,
so
it
is
a
purely
c
plus
plus
language.
You
can't
there
there
isn't
a
fortran
api
for
a
sickle
but
yeah,
maybe.
G
That
we've
kind
of
talked
about
before
and
while
sickle
doesn't
have
any
kind
of
direct
interoperability
with
fortran
and
sickle,
does
provide,
what's
called
like
a
host
task,
which
is
a
feature
where
you
can
run
arbitrary
c,
plus
plus
code
within
the
single,
dag
and
scheduling
bag,
and
then
from
there
you
could
potentially
invoke
like
sort
of
interoperate
with
with
fortran
code,
two
sort
of
like
standard
sort
of
cabi
interoperability.
G
C
I
Yes,
I
have
essentially
two
questions.
First
is
if
I
have
an
existing
pure
c
plus
plus
application,
could
I
just
compile
it
with,
let's
say
the
sql
compiler
and
hope
it
will
just
work,
even
though
it
will
just
work
on
the
c
on
the
cpu.
B
If
you
have
cpu
code,
you
can
you
you're
just
using
clang
or
you're,
using
like
the
the
llvm
infrastructure,
so
you
don't
need
to
use
like
sickle
necessarily,
you
can
just
use
it
as
a
c
plus
plus
compiler
and
it'll
work,
but
it
won't
it'll
purely
work
on
the
cpu
in
the
exact
same
way
that
it
would
work
with
any
other.
I
Compiler,
it's
fine!
It's
just
when
you
want
to
board
stuff.
You
want
first
to
start
with
something
that
works
without
any
changes
and
move
from
there
on.
So
I
was
just
wondering
if
I
can
do
that,
and
it
looks
like
the
answer
is
yes,
which
is
great
absolutely
absolutely.
Maybe
this
was
partially
answered
before,
but
if
I
wanted
to
interface
with
something
like
cuda
fft,
is
that
doable
reasonable?
I
B
Absolutely
yeah
so
sickle
and
dpc,
plus
plus,
as
well
offers
like
a
lot
of
interoperability
apis
whereby
you
can
essentially
write
native
cuda
code
and
we're
not
going
to
be
covering
that
today,
it's
a
little
bit
out
of
the
scope,
but
this
is
something
that's
completely
possible
with
with
sickle
with
with
dpc
plus
plus
as
well.
You
could
yeah
you
can
write
completely
native
cuda
code
in
these
kind
of
interrupt
tasks
which
is
yeah
very,
very
natural.
B
So
this
is
a
really
easy
way
to
port
cuda
code
into
sickle
code
and
then
maybe
slowly
modify
them
to
kind
of
more
sickle-leaning
things.
H
C
Yeah,
I
can
quite
need
to
to
an
example
of
how
that's
done.
This
is
some
examples
that
we
could
point
you
to.
G
I've
followed
on
a
little
bit
from
that
as
well
and
as
well
as
dpc
plus
plus
one
api
also
provides
a
series
of
libraries
so
things
like
one
dna,
one
mpl
and
I
think
there's
there
is
one
coming
from
for
one
difficulty.
G
I
think
that's
not
so
we're
we're
also
working
to
try
and
support
these
with
cuda
back
end
as
well,
so
the
qft
isn't
available
yet,
but
that's
kind
of
in
the
roadmap
for
the
future.
So.
E
Are
there
plans
to
support
multiple
gpu
back
up
there,
so
if
you're
wrapping,
would
it
also
work
with
one
api
mkl?
I
guess
you
don't
have
an
amd
backup.
G
Yeah,
so
I
think
the
so
these
libraries
all
all
like
the
back-ends
for
intel
platform
and
we're
working
on
supporting
these
for
the
gooda
back
end.
So
far,
there's
there's
a
good
amount
of
support
for
one
mkl
and
one
dnn
we're
actually
in
the
process
of
adding
additional
support
for
one
dnn
and
then
we've
got
kind
of
we
we're
kind
of
keen
to
support
some
of
the
other
libraries
like
the
cool,
fft
and
qdpl.
Hey,
sorry,
ben
one
dpl.
G
Planned
for
the
future,
but
I'm
not
available
just
yet
and
obviously.
E
Okay,
I
actually
have
written
a
thin
wrapper
around
all
of
the
it
also
supports
amd,
so
it's
intel
cuda
an
afd
rocket
platform
for
a
subset
of
blasts
and
ssp
that
we
needed
for
our
applications.
So
I
guess
if
anyone
else
needs
something
now,
you
should
take
a
look
at
what
we've
done:
design
github
under
g-tensor.
J
B
So
currently
so,
theoretically,
a
single
queue
doesn't
have
to
necessarily
have
a
relationship
to
a
cuda
stream,
but
so
in
in
certain
implementations,
notably
hipsicle,
a
cue
maps
to
a
collection
of
streams,
meaning
that
you
can
essentially
have
a
queue
executing
concurrently,
which
is
the
goal.
But
currently
it
is
actually
mapping
directly
to
a
single
stream.
But
that
is
liable
to
change.
C
Sorry,
I've
posted
the
the
links
on
the
slack
channel,
so
everyone
should
be
able
to
get
them
there,
but
I'll
yeah
I'll.
Let
you
talk
through
how
the
how
this
example.
J
B
Yes,
so
you
need
to
clone
the
the
civil
academy
repo
and
make
sure
that
you're
on
yeah
pearl
motor
workshop
and
then
with
those
code
exercises.
B
Let's
say:
exercise
okay,
so
essentially
we
just
want
to
make
sure
that
our
we're
able
to
compile
with
the
signal
headers
essentially
so
we
want
to
include
the
header
default
construct
queue.
So
this
we
don't
necessarily
need
to
think
too
hard
about
this
at
the
moment,
see
what
devices
associated
with
that
queue
and
then
so.
This
is
a
string,
you'll
be
getting
the
info
of
the
device's
name
and
then
maybe
maybe
printing
that
out
or
something
just
so
that
we
know
what
device
that
we
have
chosen.
B
B
B
So
we
have
the
solutions
thing,
so
I
would
recommend
not
to
look
at
the
solutions
before
trying
the
exercise
yourself,
but
it's
a
free
free
world,
so
yeah.
So
this
is
just
a
simple
example
of
what
you
might
use
default:
constructing
a
queue
get
the
device
get
the
dev
name
and
then
chosen
device.
B
If
people
are
having
issues
that,
obviously
this
is
very
simple
code.
So,
if
you're
having
issues
with
this
sort
of
code,
then
we
need
to
make
sure
that
that
we
figure
with
with
a
dot
out,
okay,
sorry,
okay,
yeah.
We
need
to
make
sure
that
we
figure
out
any
issues
now
before
we
get
on
to
more
complex
things.
B
Yeah
so
there's
a
single
spec
and
also
the
the
one
api
kind
of
you
know
spec
as
well.
So
I
find
that
the
second
spec
is
very
good.
B
B
So
I
think,
potentially,
someone
was
having
an
issue
with
could
air
out
of
memory,
I'm
not
sure
if
this
has
got
to
do
with
not
using
a
batch
script.
If,
if
this
is
the
kind
of
error
that
you
might
be
susceptible
to,
if
many
people
are
trying
to.
B
Is
something
that
happens
so
do
we
have
do?
We
have
a
reference
kind
of
s
batch
that
is
somewhere
in
the
slack
or
so
that
we
could
direct
people
to.
C
C
On
the
original
deck
that
was
shared,
maybe
in
our
full
script,
let
me
let
me
dig
out,
because
I
find
something.
H
H
Okay,
so
it's
not
like
you
have
to
manually,
could
I
destroy
something
something
when
using
sickle?
No.
B
No,
no!
No!
No
so
this
is
yeah
again
c,
plus
plus
has
its
own
destructors
yeah,
so
yeah,
the
the
c
plus
paradigm
is
a
good
rule
of
thumb.
So
if
something
works,
usually
in
c
plus
plus
same
for
cycle.
B
Cool
okay,
yeah.
D
B
Might
crack
onto
the
next
exercise.
K
B
B
So
query
your
your
signal,
ls
and
you'll-
see
what's
available,
so
you
can
so
the
the
way
that
I
think
is
nice
to
specify
the
runtime
device
is
using
or
sorry
device
filter
is
using
single
device
filter
and
we
can
say
host
if
you
essentially
use
any
of
the
words
that
are
used
in
any
of
these
triples,
then
it'll
select
the
one
that
matches
so
in
this
case
running
on
sql
host
device
success.
D
B
Okay
and
then
that
was
involved
binary
for
the
the
because
I
compiled
with
spear
v,
but
I
would
recommend
using
sega
device.
Filter
is
cuda.
There
are
other
ways
of
constructing
cues
where
you
statically
choose
what
kind
of
a
device
that
you
want-
and
this
is
this
is
good
as
well,
but
again
we
don't
have
a
well
this.
This
is
outside
the
scope
of
today's
workshop,
but
there
there
are
lots
of
materials
online
where
you
can
see
how
to
do
this.
It's
very
very
simple:
it's
just
that!
F
Start
to
interrupt
you,
so
here
we
need
to
specify
ext
underscore
one
api
underscore
coder
yeah.
B
Yeah
we
can
you,
you
don't
need
to
type
the
full
name.
Essentially,
I'm
not
sure
exactly
how
this
is
matching,
but
anything
that
any
word
that
will
match
here
in
this
triple
will
specify
the
right
device.
Okay,
you
could
also
do
sql
device
filter
gpu.
F
People
because
I
think,
for
most
users,
they
probably
don't
know,
don't
know
that
they
need
to
specify
some
some
device
filter
before
they
run
the
binary
code,
so
so
yeah.
So
that's.
I
think
it's
good
for
for
people
to
to
be
able
to
to
to
realize
that
we
need
to
have
a
filter
to
to
run,
to
run
the
program.
B
So
the
the
reason
this
is,
it's
not
actually
essential.
You
can
decide
what
device
runs
in
the
code,
but
essentially
this
is,
as
I
see
it,
a
benefit
of
sickle
because
you
can
determine
which
device
things
run
out
at
runtime.
You
don't
need
to
change
any
code
to
run
it
on
different
devices.
B
I
think
this
is
a
this
is.
This
is
a
trick
if
you
like.
This
is
something
that
will
enhance
the
usability
of
your
code,
whereas
you
could
actually,
you
can
hard
code,
these
things
in
your
code,
using
okay,
just
for,
for
instance,
I'll,
go
to
sorry.
B
B
B
You
can
essentially
tell
the
the
queue
what
device
you
want
it
to
be
constructed
with
okay,
so
here
we're
going
to
use
a
gpu.
B
B
Okay,
so
let's
see
so
we'll
do
minecraft
with
us,
maybe
we'll
see
if
there's
something
decent.
B
B
B
So
in
fact,
if
you
use
your
standard,
your
default
constructor
for
a
queue,
then
you
have
far
more
flexibility
at
runtime,
because
you
can
choose
things:
okay,
you
can
choose
if
you
want
to
run
on
this
device
that
device
that
device,
whereas
this
will
only
allow
you
to
run
it
on
a
gpu
and
that's
that
might
be
desirable,
that
might
it
depends
on
what
you're
doing,
if
you're,
trying
to
prototype
debug,
it's
very,
very
useful,
very
dynamic
to
be
able
to
swap
between
host
and
other
devices
very
quickly
but
yeah,
and
there
are
lots
of
different
selectors.
F
Oh,
it's
a
good
idea,
so
it
is
so.
You
recommend
that
we
just
specify
a
generic
queue
without
we
specify
a
queue
without
specifying
the
selector,
and
then
we
can
use
the
device
filter
to
run
a
program
on
on
a
specific
device.
F
B
User
like
this
is
how
I
use
sickle,
and
I
find
that
this
is
really
really
helpful,
but
it's
it's
up
to
the
user.
I
think
okay,
anyway,
I'm
you're
welcome.