►
Description
Felix Wittwer (NERSC)
Accelerating X-Ray Tracing for Exascale Systems using Kokkos
A
The
motion
of
the
atoms
is
effectively
Frozen
and
so,
by
varying
the
delay
of
the
lever,
you
can
record
stop-motion
movies
of
chemical
reactions,
so
you
can
see
how
all
the
movie,
all
the
atoms,
are
moving
and
during
the
reaction
now
the
main
problem
with
this
is:
you
can
hit
each
Crystal
only
once
and
so
to
collect
the
full
data
set.
We
need
a
continuous
stream
of
crystals
and
all
in
all,
it
needs
something
like
100,
000
crystals
for
one
data
set,
but
the
problem
is
the
crystals
are
shot
into
the
beam.
A
So
we
have
no
control
over
where
the
beam
hits
the
crystals
and
how
the
crystals
are
oriented.
But
since
these
we
need
to
know
the
scattering
of
the
x-rays
from
all
directions,
we
need
to
collect
as
many
data
as
much
data
as
possible
because
in
the
end,
it's
a
little
bit
like
collecting
trading
cards
where
to
get
a
complete
set.
A
You
need
to
buy
much
much
more
card
package,
the
packets
that
there
are
cards
available,
because
you
have
no
control
over
what
the
next
card
or
data
shot
will
be,
and
so
because
the
there's
only
one
extra
x-ray
laser
in
the
US
getting
measurement
times
where
we
scarce
and
that's
it's
important-
to
get
results
quickly
and
determine
is
collected
data
useful.
Can
we
move
on
to
the
next
sample
and
for
this
reason
these
expert
types
of
experiments
require
live
feedback
now.
A
Currently
this
instrument
generates
about
100
images
per
second
and
so
for
a
hundred
thousand
images.
One
data
run
takes
about
15
minutes
to
collect
and
so
live
feedback
means
we
want
to
know
if
this
data
is
useful
in
something
like
10
to
20
minutes
and
to
analyze.
These
terabytes
of
data,
we
need
the
super.
B
A
Supercomputer
most
often
per
Mata
at
nurse,
but
the
big
problem
is
the
schedule,
for
the
experimental
side
is
totally
independent
of
the
schedule
of
nurse.
So
when
we
get
experimental
time,
perimeter
might
not
be
available
and
then
we
need
to
use
other
sites,
for
example
Frontier
at
Oakridge
or
Aurora
at
Argon.
A
We
would
have
to
fragment
our
code
because
each
Hardware
vendor
has
their
own
different
programming
model
and
to
avoid
this
and
this
whole
nightmare
of
maintaining
three
different
codes,
you
need
some
something:
a
bit
more
abstract
some
abstract
programming
model
which
can
Target
all
these
different
hardware
and
all
the
different
hardware.
And
there
are
a
bunch
of
options,
for
example,
open
ACC,
Cocos
or
openmp
Target.
A
Now
Cocos
is
a
C
plus
plus
growing
programming
model,
but
the
nice
thing
about
it
compared
to
openmp
or
cooler.
It
doesn't
introduce
any
new
syntax,
so
you
don't
have
any
proc
mass
or
triple
brackets
or
anything
like
that,
and
the
two
Central
pillars
are
abstract
execution
and
memory
spaces.
So,
instead
of
targeting
device
kernels
for
GPU,
the
Cocos
uses
the
five
targets,
a
function
for
an
abstract
execution
space,
and
only
during
compilation
do
you
decide
or
specify
where
this
execution
space
should
live
and
be
calculated.
A
A
A
The
way
these
execution
spaces
are
implemented
in
the
code
itself
is
done
via
execution
patterns.
So
the
classical
case
is
you
have
some
some
for
Loop,
which
you
want
to
paralyze,
for
example,
for
us,
we
want
to
simulate
the
scattering
of
the
x-rays,
so
we
want
to
calculate
for
each
pixel
of
the
detector.
How
many
x-ray
photons
will
hit
this?
On
average,
this
this
detector,
and
so
the
original
C
code,
C
plus
plus
code
we
just
had
to
try
and
for
Loop,
which
just
runs
through
all
the
pixels
and
calculated
the
scattering.
A
Now
the
scattering
of
each
pixel
is
independent
from
all
the
other
pixels,
so
this
can
be
trivially
parallelized
and
the
idea
behind
cocus
is
you
have
two
things
the
for
Loop
has
a
policy
power
should
be
parallelized,
so,
for
example,
in
our
case,
just
runs
through
all
the
pixels,
and
then
you
have
the
body
of
the
function
of
the
followed,
which
just
tells
you
what
should
be
calculated
and
so
going
from
C
plus
plus
to
Cocos
means
just
replacing
the
follow
with
the
Cocos
power
for
execution
pattern
and
the
body
of
the
function
can
model
is
just
stay
the
same
apart
from
the
parallel
four
pattern,
which
just
runs
all
iterations
independently,
there's
also
parallel
reduce
where
you
combine
all
the
different
iterations
into
one,
for
example,
if
you
want
to
calculate
the
sum
of
the
squared
entries
of
the
list
and
then
there's
also
a
third
pattern
available
called
Power
scan,
which
runs
multiple
reductions.
A
Is
the
memory
management
because
gpus
have
their
own
memory
that
separate
from
the
system
memory?
So
any
calculations
you
want
to
do
you
always
need
to
transfer
the
data
out
from
the
system,
memory
to
GPU
memory
and
after
the
calculation
is
finished,
you
need
to
transfer
the
memory
back.
The
data
back
onto
the
system
memory,
and
so
a
lot
of
good
account
is
just
taking
care
of
this
memory
management.
For
example.
If
you
want
to
create
the
trusted
array
of
zeros
on
on
cooldown,
you
first
need
to
create
pointer.
You
need
to
allocate
it.
A
B
A
A
We
use
the
small
test
program
called
nanobreck
and
nanobact
simulates
the
factual
images
at
the
pixel
level.
So
this
is
a
massively
parallel
problem
which
is
well
suited
for
gpus,
because
they're,
more
or
less
designed
for
exactly
this
to
calculate
images
originally
for
video
games
and
the
original
code
was
written
in
C
plus
and
already
some
years
ago
it
was
ported
to
Dakota
and
we
ran
it
on
code
GPU
and
the
coda
particular
resulted
in
about
the
20x
speed
up,
but
now
for
permata
and
preparation
for
Frontier.
A
This
took
us
a
couple
of
weeks
with
some
some
pitfalls
for
using
Cocos
and
getting
to
know
Cocos,
but
mostly
it
was
just
search
and
replace
and
making
sure
that
we
didn't
introduce
any
errors
when
we
replaced
the
Cuda
with
with
the
cocoa
structure,
and
the
big
question
is
of
course
now.
How
did
this
affect
the
performance
for
us
and
the
our
standard
test
test?
A
Benchmark
is
to
simulate
100,
000
images,
and
we
tested
this
running
on
128
nodes
of
permata
and
the
original
Cuda
code
run
in
about
two
and
a
half
minutes,
and
by
switching
to
Cocos.
We
surprisingly
enough
got
even
a
better
performance
of
just
a
bit
over
two
minutes,
so
it
turns
out
that
the
original
code
used
a
lot
of
registers.
A
So
we
couldn't
really
occupy
fully
occupy
the
CPU
and
Cocos
use
just
enough
just
so
much
fewer
registers
that
we
could
occupy
GPU
more
and
thus
could
achieve
faster
calculations
and
concerning
portability
we
run
the
same
code
on
Crusher.
A
The
frontier
testbed
system
is
operate,
which
has
AMD
and
I-250x
and
the
same
code
just
changing
the
compiled
Flex,
as
I
mentioned
before,
run
there
for
54
seconds,
which
can't
directly
Compare
the
numbers
between
permit
and
Crusher,
because
the
nodes
are
slightly
different,
but
just
in
general
it
will
be
achieved
pretty
good
performance
on
both
systems
with
the
same
code.
So
we'll
ask
going
to
Cocos.
Was
the
nice
the
right
way
to
avoid
whether
they're
going
because
we
need
to
be
able
to
use
different
systems?
A
The
porting
itself
was
relatively
straightforward,
where
there
were
some
specularities
which
are
not
really
focused
forward,
they're
more,
like
compiler
bugs
well,
let's
say
unattended
Behavior
upon
compiler,
it's
nice.
A
Thing
is
it's
also
it's
pure
C
plus
plus,
so
you
don't
need
any
fancy
new
syntax
and
you
also
don't
have
to
worry
about
syntax,
highlighting
because
it's
all
C,
plus
plus
one
slight
problem
with
Cocos-
is
that
as
it
tries
to
support
everything
it
sort
of
has
the
least
the
most
common,
the
smallest
common
amount
denominator.
So
the
for
example,
Cuda
Library
support
is
limited
because
it
needs
to
ensure
that
it
runs
on
all
the
systems.
A
So
if
you
are,
for
example,
relying
on
qfft,
then
you
would
have
to
sort
of
implement
this
on
your
own,
which
can
be
done,
but
Cocos
can
only
help
you
slightly
there,
even
by
just
staying
on
a
video.
How
can
we
gain
some
performance?
Because
the
pure
cooler
programming
was
done
by
the
caucus
development
team
and
not
by
us,
so
they
are
probably
much
more
capable
of
doing
this
and
switching
to
different
Hardware
also
verified.
A
B
Thank
you
Felix
for
a
very
interesting
talk
on
how
news
gpus
and
GoPros
for
for
data,
and
we
like
to
remind
the
users
that
tomorrow
and
day
after
there
will
be
more
talks
on
gpus
for
data,
so
please
feel
free
to
tune
in
for
those.
B
If
there
any
questions,
please
put
them
in
the
chat.
If
not
actually
I
do
have
a
question.
So
I
was
interested
to
note
that
the
Coco's
performance
was
faster
than
the
Cuda
performance.
So
is
that?
Because
maybe
you
could
write
Cuda
a
little
bit
better
and
therefore
the
Kura
would
because
I
I
wouldn't
naively
think
that
the
Cocos
would
be
almost
as
good
as,
but
not
never
better
than
Cuda
yeah.
A
But
so
I
would
say
you
can
probably
write
Cuda
code.
That
is
as
fast
as
Cocos
code.
The
question
is:
is
your
average
Cuda
code
as
as
good,
so
we've
done
some
profiling
and
looked
into
this
and
what
we
discovered
was
so
what
this
method.
A
So
we
have
a
bunch
of
different
corners
and
we
noticed
that
the
performance
increase,
but
the
difference
between
cooler
like
in
every
single
kernel
caucus
was
faster
even
for
some
simple
kernels,
which
were
just
more
or
less
a
vector
at
and
the
other
right
here
on
the
right.
So
I
don't
know
like
it's,
it's
probably
just
some
minor
setting
or
something
that
if
you
know
Cuda
you
do
this,
but
for
the
average
user.
You
are
not
aware
of
this
or
something
that's.