►
From YouTube: Data Parallel C++ (DPC++) Programming Model
Description
Abhishek Bagusetty (Argonne)
Data Parallel C++ (DPC++) Programming Model
A
From
a
performance
Engineering
Group
at
argon
leadership,
Computing
facility,
so
I'd
like
to
thank
the
organizers
for
giving
me
this
opportunity
to
talk
about
the
DPC
plus
plus
in
a
different
setting
for
Paul
matter.
First
I
would
like
to
acknowledge
Jeff,
Larkin
and
Johannes
to
who
has
laid
the
groundwork
for
much
of
my
talk
in
fact.
So
further
Ado
I'll
talk
about
data,
Barrel,
C,
plus
plus
programming
model
for
specifically
Nvidia,
gpus
and
and
and
more
practical
case
per
Mater
in
this
case.
A
So
I'll
start
with
a
different
note
here,
which
is
sickle.
As
many
of
you
have
heard,
this
keyword,
it's
it's
it's
a
specification.
It's
a
language
specification
from
the
Chronos
group.
Coroner's
group
handles
the
specifications
for
opencl,
opengl
and
and
many
other
models.
So
sickle
is
one
of
that
and
I
would
like
to
emphasize
that
simple
is
not
a
programming
model,
but
it's
a
language
specification
and
I'll
get
in
in
a
bit.
A
How
sickle
and
DPC
plus
plus,
are
related
to
each
other,
so
long
story
short
sickle
is
nothing
but,
as
I
said,
like
a
language
specification
which
carries
some
of
these
features
which
have
a
similar
heuristics
like
opencl
mining.
So
anyone
familiar
with
opencl,
you
might
see
a
exactly
same
keywords
appear
in
Signal.
A
The
the
Salient
feature
with
SQL
is
it's
a
C
plus
plus
single
Source
programming
model,
which
means
that
that
both
host
and
the
device
code
coexist
and
when
I
say
device
code
I'm
not
really
referring
to
the
gpus.
It
could
be
anything
it
could
be:
gpus,
fpgas,
dsps,
Etc,
so
so
long
story
short,
the
single
C
plus
plus
source
code,
runs
on
on
on
host
and
the
devices
and
and-
and
there
is
a
and
the
memory
model
here
which
refers
to
USM
and
the
buffer
model.
A
So
I'll
give
brief
examples
later
in
this
session
about
what
are
the
differences
between
these
memory
models?
The
the
biggest
Advantage
between
these
two
is
one
offers
a
lot
more
control
over
the
memory
transfers
and
other
just
does
that
job
for
you.
So
so
that's
the
that's.
The
Stark
differences,
unlike
any
other
gpgp
programming
model,
take
Cuda,
hip
opencl
Circle
I
mean
this
is
like
the
fundamental
feature
that
one
would
request.
A
So
it
carries
a
synchronous
programming
model
where
you
just
overlap,
compute
copy
host
operations
and
and
and
to
just
maximize
the
performance
and
reduce
the
time
to
solution
coming
to
the
most
important
point.
It
offers
portability
as
I
said
like
it.
It
initially
is
a
single
C
plus
plus
source
code,
which
can
run
on
host
and
GPU
devices,
but
it
could
run
on
any
hardware,
not
just
initially
designed
for
Intel
gpus,
which
was
clearly
the
motivation,
but
it
could
run
on
on
Nvidia
and
the
AMD
GPS
as
well.
A
So
and
and
let's
not
forget
about
productivity,
so
how
many
lines
of
code
you
write?
What
is
the
biggest
performance
Gap
that
you
see
with
the
native
programming
model,
for
example
Cuda,
so
I'll
briefly
touch
based
on
those
points.
A
Moving
on
the
title
coming
to
the
title
of
the
talk,
the
data
panel
C
plus
plus
this
is
nothing
but
an
Intel's.
One.
Api
implementation
of
secure,
as
said
sickle,
is
a
language
specification
and
vendors.
Have
the
freedom
to
take
that
specification
and
Implement
and
provide
tools
for
us,
so
DPC
plus
plus,
is
a
programming
model,
which
is
an
implementation
of
second,
and
you
could
view
it
as
another
way
that
it's
just
a
mixture
of
C
plus
plus
and
some
SQL
standards
with
some
extensions.
A
So
these
are
the
three
components
that
forms
DPC,
plus
plus
so
as
you've
heard
in
the
morning
that
the
C
plus
plus
is
getting
evolved
as
a
parallel
modern
programming
language
since
C
plus
plus
17
ISO
standards.
So
one
of
the
the
fundamental
restrictions
of
SQL
is
C,
plus
plus
17
compliant
as
I
was
earlier
saying
that
it
is
based
on
a
fourth
mode
in
C,
plus,
plus
and
and
the
most
important
feature
is
the
cross
architecture
standard.
A
Moving
forward
so
I
just
talked
about
C
plus
plus
Circle,
but
there
is
a
another
piece
called
extensions.
These
are
the
extensions
that
the
vendors
Implement
for
anything
coming
from
productivity
ease
of
use
performance.
Anything
like
related
to
those-
and
some
of
these
extensions
were
fortunately
adopted
by
the
sickle
standards
recently.
A
So
the
goal
is
all
these
implementers
of
sickle
standards
develop
their
extensions
and
if,
if
they
prove
to
be
more
useful
to
the
open
Community,
the
llbm
community
adopts
that
and
the
goal
is
just
to
open
source
it
to
the
nlpm
Upstream
and
and
more
importantly,
these
extensions
were
closely
observed
by
the
circle
or
and
the
Chronos
working
groups.
A
So
it's
it's
it's
a
quite
a
collaborative
effort
and
and
and
a
good
feedback
loop
as
I
was
just
saying
that
Sickle
is
a
portable
programming
model,
with
specification
based
off
of
17
standards,
backed
up
heavily
by
the
industry.
A
It's
it's
open
source
and
and
then
it's
a
single
Source
programming
model.
So
there
are
several
libraries
that
are
built
up
on
Circle
based
off
of
C
plus
plus
the
the
this
is
a
flowchart
of
sim
and
I'll.
Just
make
it
simple
that
it
just
has
two
layers
of
compilation:
workflow,
one
for
the
device,
one
for
the
CPU.
If
it
goes
through
the
device
it
chooses,
a
circle,
compiler
chooses
whatever
Back
ends,
opencl
targeting
all
these
devices
or
other
backends.
A
You
could
choose
Cuda
hip
anything
that
just
targets
those
devices
so
and
and
and-
and
there
is
another
option
for
traditionally
for
the
CPUs
as
well.
So
this
is
just
a
simple
compilation:
workflow,
if
you,
if
you
take
a
circle,
programming
model,
I'll,
just
talk
about
the
compilers
vendor
players
are
active
in
the
space
and
and
and
mostly
focus
on
the
leftmost
one.
We
just
enter
DPC,
plus
plus
implementation.
There
are
other
active
contributors
such
as
Coldplay
hipsico,
on
the
on
the
DPC
plus
side.
A
You
just
have
like
several
devices
each
going
through
its
own
plugin,
so
Intel
gpus
goes
to
the
level
zero
plugin
Nvidia,
with
an
nvpx
and
AMD
with
the
gcn,
so
DPC
plus
plus
just
provides
a
portability
layer
for
Circle
and
then
targets
all
these
devices.
A
So
what's
the
story
with
the
sickle
Ed
nurse,
so
there
has
been
a
collaboration
between
alcf
nurse
and
Coldplay,
but
to
enable
Nvidia
A1
cyclone
in
Media
a100
gpus.
A
So
the
initial
scope
of
the
work
is
vastly
completed,
but
there
is
also
a
good
bit
of
tracking
for
the
libraries
as
well,
because
any
scientific
Endeavor
requires
a
good
support
from
the
compilers
and
the
libraries
to
have
a
good
seamless
portability
story.
A
So
yeah
you
could
check
out
this
module
here
and
and
sickle
at
nurse.
There's
a
there
has
been
a
training
event
that
happened
in
March,
there's
a
pretty
good
material
training
material
that
one
could
just
look
at
it's
it's
it's
self-evolving
so
feel
free
to
check
this
out
and
I'll
just
talk
about
some
of
the
heuristics
of
sickle
people
who
are
familiar
with
the
Cuda
or
hip
it.
It
has
a
very
similar
equivalence.
A
So
I'll
talk
about
single,
cues
and
context
as
abstractions,
so
it's
a
good
cues
are
just
mechanism
to
provide
work
to
a
device.
Think
of
it
that
way
and
Signal
contexts
are
nothing
but
like
a
Cuda
context,
which
everyone
overlooks
at
it
and
the
same
cases
but
signal
so
Circle
Q
is
like
a
handle
that
you
submit
a
job
to
it
and
then,
and
then
this
this
just
dispatches
the
work
to
the
host
or
device.
So
to
be
brief
context
is
nothing
but
like
a
like
a
context.
A
Loosely
speaking,
these
contexts
provides
a
mechanism
for
resource
isolation
or
training.
So
whether
you
want
to
share
the
memory
with
the
next
GPU
or
or
or
not
and
and
queues
are
nothing
but
like
a
good
streams
which
most
of
you
are
familiar
with.
Hopefully-
and
this
provides
an
asynchronous
mechanism
with
the
host
one
vehicle.
A
Cues
are
not
only
in
order
or
out
of
order,
and
these
are
both
in
order
and
out
of
order,
so
you
could
choose
either
or-
and
if
you
choose
an
another
Circle
Cube,
this
exactly
mimics
the
feature
of
Cuda
string,
which
is
first
and
first
down
so
Circle.
A
Like
Cuda
has
very
one-to-one
mappings,
as
as
Johannes
was
talking
about
some
of
the
active
development
that
goes
into
the
compilers.
Yes,
you
can
build
your
own
compiler
with
ease,
so
this
these
are
the
instructions
to
build
on
the
Pearl
Motor
and
there
is
already
a
module
that
one
could
use
as
well.
So
if
you
just
clone
build
with
the
cmake
instructions-
and
you
will
find
the
compiler
so
this
this
is
as
simple
as
that,
but
it
just
takes
a
while
to
build
it.
So
just
be
careful
on
that.
A
A
So
it's
just
it's
just
as
I
said
like.
If
you
see
these
keywords,
these
keywords
have
very
familiar
namings
with
the
opencl
as
well.
So
people
who
are
familiar
with
Cuda
will
be
very
easy
to
adapt
to
the
sickler
as
well
and
and
and
what
is
the
motivation
behind
adapting
to
sickle?
Is
it's
a
portable
programming
models
right
I'll,
just
briefly
touch
on
the
subgroups
here.
A
Subgroups
are
nothing
but
warps
included,
so
people
who
are
very
familiar
with
the
Kuda
warps
and
and
what
it
offers
it
exactly
offers
the
same
features
so
so:
SQL
top
groups,
maps
to
Cuda
warps
on
the
Nvidia
side
and
on
on
the
way
front
side
on
the
AMD
side.
A
So
so
the
memory
model
is
again
very
much
similar
to
Cuda
as
well
register
shared
memory,
Global
memory,
if
you're
familiar
with
all
these
terms.
It's
exactly
the
same.
So
there's
the
learning
curve
is
quite
small
when
you
venture
into
sicko
I'll
just
show
us
a
simple
snapshot
of
how
you
allocate
memory
in
Cuda
versus
how
you
allocate
memory
and
and
and
and
Signal
most
important
thing
here
is:
all
you
need
to
change
here
is
the
GPU,
which
then
runs
the
entire
battle
code
on
the
GPU.
A
So
if
you
choose,
if
you
replace
GPU
selected
with
a
CPU
selector,
it
would
run
it
on
a
CPU.
So
that's
the
way
it
touches
the
different
Hardware
so
either
you
could
do
it
in
the
compile
time
way
or
you
could
do
it
in
the
runtime
fashion
as
well.
A
So
this
is
like
a
very
textbook
example
of
circle
that
I
just
want
to
show
it
to
you.
What
does
the
workflow
look
like
and
and
I
just
want
to
briefly
touch
on
the
buffer
memory
model
that
I
have
introduced
at
the
very
starting?
So
you
have
standard
headers.
A
A
A
Moving
to
the
USM
memory
model,
which
offers
a
slightly
familiar
feature
because
buffers
are
like
quite
complex
and
too
much
code,
if
you
see
USM
model,
you
would
see
all
these
pointers
which
you
are
familiar
with
so
USM
models
is
nothing
but
a
pointer-based
model.
So
you
have
the
data
structures,
allocate
memory,
mem
copy,
launch
the
kernel
copy,
the
results
back
and
print
that
results.
This
workflow
is
very
similar
to
Cuda
and
any
other
gpgpu
programming
model.
So
this
is
the
memory.
A
I'll
just
skip
this,
in
the
sake
of
time
talking
about
performance
benchmarks.
So
these
are
the
benchmarks
that
were
carried
out
in
collaboration
with
Coldplay
on
Pearl
mutter.
A
The
question
is,
how
does
Cuda
compare
with
cycle
so,
as
you
can
see
that
this
is
the
Bible
stream,
a
benchmark
that
is
heavily
used
these
days
on
different
Hardwares
different
programming
models,
you
can
see
in
most
of
the
cases
if
you
compare,
blue
and
and
orange
it's
almost
very
comparable
and
and
and
lulish
is,
is
one
of
the
the
the
mini
apps
that
you
just
talked
about
from
Jeff
Larkin,
and
so
you
could
see
the
difference
in
the
performance
between
good
and
second,
and
this
is
quite
old
Benchmark,
so
it
would
starkly
get
improved
and
and
similarly
with
RS
bench
and
and
d
slash.
A
These
are
some
of
the
other
Benchmark
assessment,
as
you
can
see
that
sometimes
beats
Cuda
performance
and
and
a
very
similar
performance.
So
these
are
some
of
the
benchmarks
that
we
just
carry
out
and
and
and
and
and-
and
these
are
and
I
should
admit-
that
these
are
quite
old
and
needs
to
be
rerun
and
and
the
story
would
change
quite
a
lot,
because
there
has
been
quite
a
bit
of
improvements
coming
to
the
Practical
aspects.
A
I,
don't
know
what's
happening
here.
Okay,
the
question
is
how
do
I
Port
an
existing
cool
data
cycle,
so
there
is
an
open
source
tool
which
ports
a
Cuda
project,
I'm,
not
talking
about
a
single
file
or
a
kernel,
the
entire
project
to
sickle
that
which
could
be
deployed
on
different
Hardwares.
This
is
open
source
feel
free
to
check
this
out.
There
is
an
additional
resource
if
you
wanted
to
see
what
are
the
equivalents?
A
Encoder
look
like
in
sickle,
obvious
practical
question
is:
if
I
have
mat
libraries,
what's
the
story
behind
that,
so
one
MPL
is,
as
a
part
of
DPC
plus
plus
one
MPL
is
nothing
but
one
math
kernel
Library.
This
works
on
multiple
backends,
so
for
NVIDIA
you
could
use
one
mkl
apis,
but
it
just
piggybacks
on
the
standard,
Cuda
libraries.
The
same
goes
for
AMD.
So
it's
the
same
performance.
There
is
nothing
different
in
the
performance
or
a
performance.
Hit
That
You
observe.
A
So,
okay,
these
benchmarks
were
performed
with
the
USM
memory
model,
because
buffer
model
is
quite
outdated
by
now,
I
would,
if
I,
if
I,
should
say
yes,
I,
I,
I,
I,
I,
I,
I,
I
I
do
agree
that
it's
it's
it's
very
bad
to
not
show
the
plots
with
the
that
starts
with
zero
and
I
apologize
for
that
for
this
Village
performance.
A
Yeah,
yes,
I
agree,
so
the
the
discussion
is,
it
would
have
been
a
much
better
showpiece
of
performance
when
it
if
it
starts
with
zero,
because
much
of
these
are
quite
comparable
for
delish.
Any
other
questions.
Sorry
about
the
the
background
noise.