►
From YouTube: Intro to SYCL/DPC++ for GPUs
Description
Jeff Hammond from Intel presents a talk on Intro to SYCL/DPC++ for GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: OisÃn Creaner
A
My
name
is
jeff
hammond.
I
work
for
intel
in
the
hpc
organization,
I'm
going
to
try
to
give
a
short
and
high
level
talk,
I'm
a
big
fan
of
slack
and
I
I'm
on
the
nurse
slack
every
day.
So
if
you
have
one
more
details,
I'll
put
it
there
so
that
it's
in
writing
and
then
it'll
be
saved
forever.
It's
maybe
easier
than
doing
verbal
stuff,
but
hopefully
there's
time.
A
So
I'm
going
to
talk
about
sickle
and
dpc
plus
plus,
and
I
will
explain
what
the
difference
is
sickle
is
this
chrono
standard
and
dpc
plus
plus,
is
the
intel
implementation,
and
I
have
details
on
this
and
I'm
focused
on
gpus,
although
neither
of
these
things
are
specific
to
gpus
and
I'll
show
more
on
that
later.
A
So
you
know
I
I
used
to
be
a
doe.
I've
been
around
for
a
little
while
and
one
thing
that
I
find
particularly
interesting
about
exascale
system
architecture.
A
You
know
ten
years
ago
give
or
take
you
know
there
was
this
two
swim
lanes.
One
of
them
was
sort
of
blue
jean-like
and
the
other
was
basically
nvidia
like
and
that's
that's
sort
of
how
the
world
was
oriented
and
we've
seen
we've
done
quite
a
bit
of
a
rotation
in
some
respects.
A
You
know,
there's
not
really
a
many
core
cpu,
although
that's
debatable,
because
the
the
cpus
in
both
of
these
systems
will
have
dozens
of
cores,
but
it's
not
many
core
in
a
blue,
jean
xeon
phi
sense
and
of
course
neither
the
exoscale
systems
at
oak
ridge
or
argonne
are
slated
to
have
nvidia
gpus,
although
of
course
promoter
will
have
them
and
other
sites
will
as
well.
A
So
you
know
this
is
this
is
actually
kind
of
a
pleasant
surprise
for
those
of
us
who
have
been
advocating
for
standards
and
portable
portable
programming,
because
if
you
were
planning
to
run
your
exoscale
application
on
a
mini
core
cpu
or
an
nvidia
gpu
using
something
that
was
vendor
specific,
you
are
probably
sad
now,
but
if
you
were
focused
on
something
that
ran
on
a
lot
of
different
machines
like
cocos
or
openmp,
then
you're
probably
fairly
happy
right
now,
so
I'm
going
to
talk
about
another
option
besides
cocos
and
openmp
for
portability
on
some
of
these
systems.
A
So
sickle,
like
I
said,
is
a
chrono
standard.
There
is
already
an
ecosystem
for
it.
There
are
three
different
implementations
that
are
relevant
to
gpus,
so
the
first
one
you
know
I'll
mention
is
is
intel's
data,
parallel
c
plus
plus
compiler.
A
It
is
based
on
clang
llvm,
there's
an
open
source
version
on
github
the
product
we
ship
in
one
api
is
derived
from
that
open
source.
Quite
directly,
there's
you
know,
the
differences
are
basically
different,
git,
hashes
and
compile
on
different
days
of
the
week.
We
have
some
gpu
extensions,
although
actually
as
of
literally
today,
all
but
all
of
our
gpu
extensions
are
actually
part
of
the
sickle
2020
provisional
standard
that
was
announced
today.
A
So
I'll
talk
a
little
bit
more
about
that
later,
so
the
dpc
plus
plus
compiler
supports
intel,
gpu,
cpus
and
fpgas.
Obviously,
we
implemented
that
codeplay,
which
is
a
company
in
inbro,
contributed
support
for
nvidia
that
that's
available
in
the
open
source
version.
You
have
to
build
it
yourself
for
cuda
licensing
reasons,
nothing.
We
can
do
anything
about
there's
also
codeplays
compute
cpp
product,
which
is
a
different
implementation
than
intel's
they're,
both
based
on
klang
llvm,
but
they
are
d.
A
They
are
different
and
that's
great
because,
as
folks
know,
if
you've
ever
debugged
a
clang
bug
or
a
vendor
compiler
bug
it's
wonderful
to
compare
it
to
gcc
and
know
that
you
know
there's
two
different
code
paths
and
if
they
both
give
the
same
answer,
then
maybe
your
code
is
wrong
and
if
they
give
different
answers-
and
maybe
one
of
the
compilers
is
wrong,
so
code
plays
compiler
supports
opencl
spear
v
devices,
of
which
there
are
a
few
and
I'll
show
that
on
a
later
slide,
their
compiler
supports
our
gpus
among
others,
and
they
also
have
a
ptx
backend
for
nvidia.
A
So
there
are
both
of
these
first
two
support,
both
nvidia
and
intel
and
any
other
device
that
supports
sphere
v
and
opencl,
which
is
unfortunately
not
that
many
heidelberg
university
produces
something
called
hipsicle.
This
is
by
a
fantastic
young
fellow
named
axel
alpe,
and
it's
based
on
clang
llvm
and
it's
based
on
cuda
clang
and
it's
related
to
or
uses
the
same
code
as
hip,
which
is
in
the
name.
A
Obviously
so
hip
sickle
is,
has
a
cpu
back
end
that
uses
openmp
and
then
video
back
end
using
cuda
clang
and
an
amd
backend
using
the
hip
rocking
stack.
So
you
see
here,
obviously
all
the
gpus
of
interest
and
actually
arm
gpus
are
supported,
not
that
that's
particularly
relevant
unless
you're
going
to
do
exoskill
on
your
cell
phone,
but
there's
a
nice
healthy
ecosystem
of
support
for
a
lot
of
gpus
there's.
A
Also
a
compiler
called
tricycle
developed
by
somebody
at
xilinx
research
that
supports
not
gpus
so,
but
you
can
look
up
the
details
it
I
use
it
on
my
laptop
all
the
time
I
I
want.
B
A
Talk
about
performance
portability
first,
and
so
I'm
gonna
cite
two
different
results
from
I
wackle
by
researchers,
not
intel.
That
makes
it
really
easy
to
cite
third
parties
when
we
compare
vendors
makes
my
life
easier.
So
you
can
see
in
the
link
here.
There's
there's
youtube
videos
there's
the
pdf
online.
It
was
free
last
time
I
checked
and
the
code
is
online,
so
you
have
full
freedom
to
explore,
reproduce,
etc.
You
can
see
here
on
the
right.
This
is
just
the
the
babel
stream
triad
excerpt
from
the
paper.
A
They
had
some
other
numbers,
but
stream
is
stream.
Triad
is
so
well
known.
I
figured
I'd
cite
that
one
and
you
can
see
here,
starting
from
the
right
on
the
amd
gpu
you
can
see
that
which,
which
generation
etc.
In
the
paper
for
details,
but
on
the
amg
gpu,
you
see
negligible
difference
between
opencl,
sickle
and
hip.
You
could
see
on
nvidia
small
difference.
You
know
five
percent
from
sickle
relative
to
opencl
and
cuda.
I
don't
know
all
the
details.
A
You
know
there
are
occasionally
run
time
and
code
generations
differences
and
sometimes
you
can
soften
the
soften.
The
differences
with
with
minor
tweaks,
if
you're,
if
you
know
how
the
tools
work
on
the
intel
device,
you
can
see,
sickle
and
open
seal,
are
essentially
identical
and
that's
they're,
basically
they're
the
same
they're,
the
same
implementation
behind
the
scenes.
It's
just
sickles
a
prettier
front
end
as
I'll
show
later
the
scale
here
is,
you
know,
notice
it's
not
zero
at
60.,
so
you
see
there's
a
25
difference
on
xeon.
A
This
is
a
compiler
bug.
Basically,
we
know
about
it.
It'll
be
fixed
at
some
point.
It
has
to
do
with
the
opencl
compiler
doing
codegen
differently
than
the
openmp
compiler,
not
something
intrinsic
to
the
programming
model,
and
I
suspect
that
one
of
the
other
sickle
compilers
for
xeon
would
not
have
this
issue.
So
this
right
here,
you
know,
if
you
care
about
memory,
bandwidth,
it's
nice
to
know
that
you
can
get.
You
know
very,
very
close
to
performance,
build
portability
and
bandwidth
with
with
a
bunch
of
different
programming
models.
A
So
this
is
another
paper
from
my
wackle
about
a
month
or
two
ago.
This
is
from
argonne
by
brian
hommerding
and
john
tram.
They
have
results
with
roger
perf
and
I
think
it
was
the
access
bench
I
think
was
in
there
again.
It's
all
online,
you
can,
you
can
grab
all
of
it
and
I
won't
try
to
go
through
all
the
details
because
they're
in
the
paper,
but
roger
perf
is
sort
of
a
suite
of
kernels
that
are
relevant
to
the
nnsa
multiphysics
workload.
A
You
know,
hydrodynamics
also
includes
some
simple
stuff,
some
complicated
stuff,
a
variety
of
different
kernels,
and
what
one
thing
that
I
think
is
interesting
about
this,
and
I
don't
even
remember
which
is
positive
and
negative
on
the
red
and
blue
with
sickle
relative
to
cuda.
But
you
see
it
turns
out
for
reasons
that
one
would
have
to
diagnose
with
assembly
reading.
But
you
know
you
see
winners
and
losers
on
on
on
both
cases.
So
it's
not
you
know,
monotonically
cuda
always
wins
and
you
always
pay
a
price
with
sickle.
A
Obviously,
I'm
sure
somebody
in
nvidia
can
make
make
cuda
always
beat
sickle
if,
if
they
play
around
with
with
the
you
know,
tuning
the
kernels,
there
are
some
differences
in
how
the
stores
are
generated.
I
think
in
some
of
these
kernels,
but
this
is
another
example
where
you
can
get.
You
know:
modest
performance
portability
with
with
interesting
code
using
sickle
on
a
gpu.
A
So
talking
about
the
different
languages,
so
what
is
data
parallel
c
plus
plus?
So
this
is
this?
Is
the
intel
compiler
implementation,
including
some
extensions?
It's
based
on
sickle
2020,
which,
like
I
said,
was
just
released
today
so
sickle
one
two
one
is
the
thing
that's
been
around
for
a
while.
It's
that's
what's
broadly
implemented
today,
sickle
2020
is
supported
approximately
all
of
it.
I
don't
know
the
exact
details
by
intel.
Codeplay
is
getting
there.
I
don't
know
the
full
feature,
compliance.
A
A
We
we
tried
to
standardize
all
of
them
and-
and
we
were
successful
with
all
the
gpu
extensions
and
the
extension
that
didn't
make
it
as
related
fpgas,
and
I
don't
know
that
it
was
intended
to
go
upstream
and
was
rejected
or
not,
but
you
know
we
expect
the
standard
compliance
sickle
code
to
always
be
sufficient
on
intel
gpus,
but
we'll
add
extensions
when
users
need
them.
A
For
example,
one
of
the
extensions
we
we
submitted
to
sickle
2020
called
usm,
which
is
unified,
shared
memory,
same
name
as
openmp5
that's
I'll
show
later,
is
basically
gives
you
malik
and
pointers,
and
we
did
that
because
our
our
friends
in
dui
working
on
cocos
etc,
said
well.
A
We
we
need
pointers,
we
need,
we
need,
you
know,
cuda
style
memory
management
in
order
to
be
compatible
with
our
design
and
that
made
sense
and
it
made
sense
to
the
sql
community
so
that
now
that's
standardized
and
you
know
in
our
case
all
of
our
extensions.
A
Both
the
documentation
and
the
implementation
are
available
on
github
there's
nothing
proprietary
about
them.
Anybody
can
re-implement
them
and
anybody
can
port
the
compiler
to
any
other
device.
So
this
is
it's.
You
know
it's
opening
in
all
the
different
degrees
of
openness.
Obviously
we
want
to
have
everything
in
chronos,
but
we'll
also
get
you
know
open
in
the
in
the
sense
of
hey
it's
on
github.
A
You
can
see
what
we're
doing
so,
why
sickle
so
opencl
is,
has
been
sort
of
the
the
portable
open
standard
for
gpus
and
other
devices
for
a
long
time,
and
there
are
some
good
things
about
opencl.
It's
it's
portable
and
that's
makes
it
better
than
a
lot
of
other
things
out
there,
but
it's
it's
got
some
warts
people
often
complain
that
it's
it's
too
verbose.
It's
maybe
you
know
the
difference
between
mpi
and
upc
or
co-ray.
A
Fortran
people
don't
seem
to
mind
mpi,
but
they
they
minded
opencl
for
whatever
reason,
but
but
the
big
reason
that
opencl
was
sort
of
not
the
right
place
to
go
is
that
opencl
does
not
have
holistic
c,
plus
plus
support
and
modern
c
plus,
is
really
an
essential,
essential
thing
for
for
modern
programming,
and
we,
you
know,
we
know
nvidia
is
very
strong
on
c
plus,
plus
stuff
and,
and
you
know,
we're
big
fans
and
there's
a
parallel
stl.
A
So
there's
all
the
c
plus
plus
stuff
going
on
out
there
and-
and
it
really
was
important
to
to
make
sure
that
c
plus
plus
was
was
a
you
know,
first
class
citizen
in
this
model,
and
so
you
know,
sickles
based
on
modern
sequels
plus
build
first
spec
was
based
on
c
plus
plus
11..
It's
now,
based
on
c
plus,
plus
17.,
so
includes
all
the
good
stuff.
A
You
know
ctad
and
all
those
other
fancy
acronyms
that
people
like
you
know
if
you
like
tbb
or
you
like
c
plus
plus
stl,
then
sickle
has
a
lot
of
the
same
concepts
there.
And
so
you
can,
you
know,
becomes
a
natural
thing
to
port.
Over,
I
will
also
say:
sickle
is
the
closest
thing
to
sickle.
I've
ever
found
is,
is
coco,
so
if
you're
comfortable
with
cocos
you'll
find,
I
think
sickle
will
be
quite
comfortable
as
well.
One
of
the
ways
I
discovered
sickle
before
intel
was
doing.
A
It
was
actually
as
part
of
a
sort
of
cross-industry
analysis
of
modern
sequels
plus
models
of
which
you
know
cocos
rogers
tbb
pstl
sickle
they
all
they
all
sort
of
showed
up
as
as
interesting
things
that
the
people
might
be
wanting
to
use-
and
you
know
sickle
is
really
the
first
standardized
programming
model
to
to
take
on
heterogeneous
programming
with
modern
c
plus
plus
you
know,
cocos
gocos
is
certainly
open.
I'm
a
big
fan
of
it,
but
you
know
coco
says
it
different.
A
Has
a
different
sort
of
you
know:
notion
of
openness
than
than
than
kronos
does
and,
and
there
are
there
are
pros
and
cons
of
each,
but
but
we
wanted
to
make
sure
we
were
building
off
of
an
industry
standard
that
that
would
be
widely
implementable
when
we
were
doing
our
gpu
software
program.
A
So
this
is
the
ecosystem.
This
is
pulled
off
of
the
kronos
website,
showing
all
the
different
implementations
and
all
the
different
devices,
and
you
can
see
you
know,
depending
on
depending
on
drivers
and
whatnot,
pretty
much
everything
is
here.
You
know
all
the
cpus
because
of
llvm
all
the
gpus
that
I
know
about
including
arm
and
whatever
power
vr
is,
I
think,
that's
a
gpu.
A
I
I
haven't
personally
verified
anything
with
xilinx,
but
I
have
run
you
know
our
sickle
compiler
on
rfpgas,
and
I
know
that
works.
So
that's
that's
pretty
much
everything
and
this
is
cool
to
me.
If
you
look
at
all
the
standards
out
there,
any
programming
language,
any
programming
model
for
accelerators
sickle
actually
has
the
most
device
support
of
anything
that
exists.
You
know
openmp
and
openacc
do
not
support
fpgas.
A
There
are,
of
course,
other
other
software
models
that
that
are
supported
on
a
subset
of
these
things.
If
you,
if
you
know
of
anything
that
supports
more
hardware
than
sickle,
please
do
let
me
know
I'd
be
curious
to
know
about
it,
so
I'm
gonna
go
real,
quick
through
some
syntax.
What's
what's
my
time
like
my
timer
didn't
start
at
zero,
so
I
don't
know
how
far
I'm
in.
A
Okay,
I'll
go
fast.
Thank
you!
So
I'm
not
going
to
belabor
this,
I'm
just
going
to
show
opencl,
verbose,
tedious,
bad
sickle
same
same
level
of
expressiveness
still
a
little
bit
verbose.
This
is
sickle
2020,
so
you
get
usm.
You
eliminate
a
lot
of
syntactic
bloat.
A
This
is
this
is
pretty
nice
and
then,
but
there's
one
thing
here:
you
might
want
to
flatten
out-
and
that's
this,
which
is
this
is
currently
not
standardized,
but
it's
literally
just
syntactic
sugar
and
a
header
file,
and
the
thing
is:
if
you
look
at
this
and
you
compare
this
to
you
know
something
else
out
there
like
cocos,
you
know
the
syntactic
expressiveness
is
is
is
pretty
much
there.
One
thing:
that's
nice.
Is
it's
fully
asynchronous?
It's
not
always
available
in
other
models.
A
You
can
see
the
coco
slack
for
details
on
that
and
I
may
have
used
too
many
characters
here,
because
I
I'm
I'm
not
the
best
c
plus
programmer
and
all
this
stuff.
I
can
give
example
code
anytime
that
a
lot
of
this
stuff
is
on
github
somewhere.
So
yep,
that's
it.
Thank
you.
B
Thank
you
jeff,
so
we
had
a
comment
there
in
the
chat
which
was
about
the
lack
of
fortran
support
and
I
suppose
you're
wondering
if
you
could
comment
on
that
and
whether
you
think
that
yeah.
A
So
first
thing
is
I
I,
I
love
fortran
I've
written
a
lot
of
fortran
in
my
life
in
nw
chem.
We
believe
intel.
You
know
that
openmp
is
is
a
fantastic
solution
for
forger
and
programmers
as
well.
As
you
know,
c
c,
99,
purists
and
c
plus
03
programmers
and
you
know
we're
supporting
openmp
on
our
gpus.
I
just
didn't
talk
about
that
here,
because
I
had
15
minutes.
So
that's
option.
One
is
use
openmp.
A
A
Doesn't
let
you
glue
stuff
onto
the
language?
The
way
c
plus
plus
does,
but
that's
why
we
have
openmp.
So
you
know
I
have
actually
talked
on
the
merits.
One
way
or
the
other
of
openmp
and
dpc
plus
us
for
fortran
applications
in
it,
which
one
is
a
priority
for
people
will
depend
sort
of
on
what
their,
what
their
comfort
level
is
and
what
their
requirements
are.
A
But
I
think
the
this
standard
answer
would
be
if
you
have
a
fortran
application,
please
keep
using
fortran
and
use
openmp
for
gpus
effectively
and
and
you'll
be
happy.
B
Okay
and
one
other
question
before
we
go
to
the
break,
which
was
a
commentary
that
there
was
open,
acc
support
for
fpgas.
Again,
I'm
not
sure
whether
that's
necessarily
a
question
you
have
an
answer
for,
but.