►
From YouTube: NESAP Project: WDMApp
Description
Aaron Scheinberg
NESAP Project: WDMApp
A
Yes,
so
this
is
a
project
which
is
a
exascale
project
as
well
as
XGC
standing
alone,
and
we've
been
doing
a
lot
of
development
and
already
science
work
on
Pearl
mentor
with
a
great
team
all
over
the
country.
A
It's
a
gyro
kinetic
code,
which
means
that
we're
taking
the
60
plasma
physics
and
reducing
it
to
five
Dimensions
using
a
technique
called
gyro
averaging
it's
a
particle
in
cell
code,
which
is
on
an
unstructured
2D
grid
in
the
polo
cross
section
which
you
can
see
on
the
right
here.
So
we
have
this
unstructured
grid
and
that's
what
really
enables
us
to
do
that
realistic
geometry
and
then
on
the
toroidal
dimension.
A
We
use
a
structural
Grid,
it's
a
series
of
planes
and
then
we
use
domain
decomposition
where
there's
these
toroidal
slices
and
then
each
MPI
rank
uses
is
in
charge
of
a
subset
of
the
grid.
A
So
these
alternating
fake
colors
show
the
different
MPI
ranks
and
the
basic
algorithm
is
that
you
have
particles
that
scatter
their
charge
onto
the
grid
and
then
using
that
information
you
can
solve
for
the
electric
field
using
the
electric
field,
you
can
push
the
electrons
and
the
ions
to
their
new
position
at
the
next
time.
Step
and
electrons
can
be
pushed
multiple
times
for
each
field
solve
and
because
we
do
this,
what
we're
calling
subcycling
the
electron
push
ends
up
being
our
most
expensive
kernel.
A
So
the
whole
device
model
or
wdm
app
is
an
exascale
Computing
project
and
the
idea
is
that
XGC
is
specialized
to
work
on
the
edge
of
a
plasma,
and
so
we
can
couple
actually
see
with
a
core
code
like
gene
or
gem,
and
that
enables
a
whole
device
modeling
and
the
vast
majority,
like
over
90
percent
of
time,
is
spent
in
XGC.
So
its
optimization
is
most
critical
and
that's
what
I'm
going
to
be
talking
about.
A
A
Our
Target
architectures
are,
of
course,
Pearl
Mudder
and
Corian
Summit
and
then
Frontier
in
Aurora
in
the
future,
and
an
interesting
thing
is
that,
of
course,
they
all
have
their
different
Native
languages,
with
Cuda
as
the
Nvidia
language
and
hip
for
Frontier
sickle.
A
So
we
opted
to
use
Cocos
and
also
convert
our
Fortran
code
to
C
plus
plus,
to
enable
this
so
Cocos
is
a
portability
abstraction
layer
that
maps
to
these
different
languages
and
allows
us
to
not
have
to
worry
about
that.
So
before
2019
XGC
was
a
Fortran
code
with
three
versions
of
our
dominant
kernels.
We
had
open
ACC
cuda4tran,
as
well
as
a
vectorized
CPU
version
and
a
simpler
reference
version.
A
A
So
our
approaches
have
been
portability
with
Cocos,
a
major
focus
on
encapsulation
and
modularity.
A
lot
of
templating.
For
example.
The
electron
push
in
the
on
ION
push
are
quite
different,
but
we're
able
to
use
the
same
code
for
them,
and
this
makes
it
a
lot
easier
than
Ford
to
experiment
and
to
swap
out
with
different
options.
A
We
also
have
each
of
our
major
code
components
able
to
be
run
independently
and
they're
using
the
same
code
base,
so
they're
not
copies,
and
that
means
that
anytime,
someone
is
working
on
these
individual
components
or
if
we
give
them
out
to
vendors,
to
look
at
at
a
specific
architecture
that
they're
never
outdated.
They
don't
require
extra
maintenance
and
any
work
done
on
those
kernels
can
immediately
go
back
and
benefit
the
code
and
we've
also
really
revamped
our
testing
and
our
CI.
With
unit
tests.
Regression
tests
run
tests
and
automated
physics
testing
still
in
progress.
A
So,
as
I
said,
the
original
attempt
was
to
keep
computation
kernels
in
Fortran
with
wrappers,
and
this
was
a
feasible
strategy
for
Cuda
Fortran,
but
it
didn't
make
much
sense
for
AMD
or
Intel
gpus.
It
also
limited
our
functionality
and
our
ability
to
fully
utilize
Cocos.
A
On
the
other
hand,
we
decided
to
opt
for
this
mid-air
replacement,
because
that
way
we
have
a
single
code
base
and
the
maintenance
and
improvements
that
are
made
to
benefit
the
current
production
code
as
well,
and
we're
also
able
to
keep
the
code
up
to
date
with
new
physics
capabilities,
and
already
this
has
happened
since
we've
been
doing
this
conversion.
We've
added
electromagnetic
physics
and
multi-species
physics
successfully.
A
So
one
interesting
question
that
has
come
up
going
specifically
from
Summit
to
promuder
is
particle
memory
management
because
on
Summit
we
currently
have
our
particles
resident
on
the
host
and
every
time
that
we
want
to
do
an
operation
on
the
device.
Then
we
send
all
of
the
particle
data
to
the
GPU,
run
the
kernel
and
then
bring
the
results
back.
A
So
this
means
that
more
particles
are
possible
on
the
in
general,
because
we
only
need
to
fit
one
species
on
the
GPU
at
a
time,
but
it
also
adds
extra
communication
time
and
on
Perl
Mudder,
there's
actually
so
much
more
space
on
the
GPU
that
we
can
just
leave
all
of
our
species
on
the
GPU.
And
that
means
that
all
that
communication
time
is
eliminated
and
then
we've
also
been
looking
at
asynchronous
streaming
to
try
and
hide
that
communication,
but
but
on
promoter,
that's
not
necessary,
but
we
do
have
to
maintain.
A
So
this
is
our
Summit
performance
going
from
January
2019
up
to
the
while
time
flies.
So
there's
a
bit
of
a
newer
there's
been
progress
over
the
last
year,
but
this
goes
up
to
November
2021.
A
So
this
is
an
electrostatic
simulation
and
basically
just
want
to
show.
We
have
steady
Improvement
because
we've
taken
more
and
more
of
these
kernels
that
are
on
the
left,
indicating
that
they're
on
CPU
and
offloaded
them
to
the
GPU,
and
as
more
of
that
has
happened,
then
we've
gotten
a
big
speed
up
foreign.
A
We
also
have
done
weak
scaling
studies
on
the
entire
machine
of
summit
and
we've
made
a
lot
of
improvement
there
up
to
present
day.
A
But
one
interesting
thing
is
that
our
weak
scaling
actually
gets
worse
because,
as
our
computations
are
all
offloaded,
then
the
communication
becomes
relatively
a
larger
part
of
our
total
time.
So
our
particle
shift
we're
removing
particles
between
nodes,
as
well
as
our
electric
field.
Interplanar
gather
suddenly
become
a
lot
more
important.
A
A
As
I
said,
the
electromagnetic
simulation
the
electron
push
kernel
is
less
important
because
it's
subcycled
fewer
times,
and
so
this
really
changes
the
the
most
important
kernels
that
need
to
be
optimized
and
we're
still
working
on
optimizing.
The
electromagnetic
version,
as
opposed
to
the
electrostatic
version.
A
We've
also
scaled
Pro
Mudder
up
to
a
thousand
prometer
nodes,
and
this
is
our
weak
scaling.
It's
not
as
good
as
on
Summit
and
I
think
there
are
a
number
of
ways
that
we
can
improve
it
going
forward
with
GPU,
aware
and
pi,
and
then
also
with
putting
more
articles
onto
each
node
and
basically
just
packing
a
larger
problem
size
than
we
had
been
doing.
A
So
one
option
is
to
run
the
same
simulation
but
packed
into
as
few
nodes
as
possible
on
each
machine.
So
I
took
one
simulation
that
I
could
run
on
64,
Corey,
K
L
nodes
and
that
total
memory
of
64
query
k.
L
nodes
is
six
terabytes
I
put
this
onto
16
Pearl
Mudder
nodes,
which
is
a
bit
more
memory
available,
but
it's
basically
just
keeping
the
nodes
packed
using
as
large
of
a
problem
size
as
I
could,
and
so
just
based
on
the
theoretical
flops
of
those
two
sets
of
nodes.
A
You
would
expect
that
the
16
Pro
Mudder
nodes
would
be
3.4
times
faster
than
the
64
query,
k,
l
nodes
and
in
this
particular
simulation
that
I
ran
it
actually
was
almost
nine
times
faster.
A
We've
already
been
able
to
do
some
Physics
on
Pro
Mudder
and
just
to
Prime
this.
So
in
many
tokamax,
the
exhaust
from
the
plasma
is
directed
along
this
separatrix
line
on
the
outside
and
then
this
plasma
goes
and
hits
the
diverter
at
the
bottom
of
the
Tokamak.
In
this
case,
and
so
the
diverter
must
be
prepared
to
handle
a
very
high
amount
of
exhaust
of
heat
from
this
exhaust
and
naturally,
a
wider
impact
area
is
going
to
be
better
to
spread
that
heat
across
a
wider
surface
area.
A
This
figure
shows
as
a
function
of
ploidal
magnetic
field
what
the
diverter
width
is
expected
to
be
based
on
some
empirical
scaling
laws
and.
A
A
So
this
is
a
visualization
of
a
simulation
that
we
ran
on
Pro
Mudder,
and
this
is
demonstrating
the
presence.
These
structures
here
are
called
homoclinic
Tangles,
and
this
is
not
something
that
I
have
presented
before
so
I,
don't
know
exactly
what
a
homo
Clinic
tangle
is,
despite
trying
to
trying
to
read
up
on
that,
but
these
structures
have
been
observed
in
the
simulation
and
the
the
end
result
is
that
you
have
a
much
more
diffuse
stream
of
plasma
from
the
x-point
to
the
diverter
than
previously
thought.
A
So
to
conclude,
XGC
is
running
on
Pro
Mudder
and
it's
generally
performing
pretty
well,
there's
still
plenty
of
work
to
be
done,
especially
in
the
electromagnetic
mode.
We
need
to
offload
some
more
kernels.
We
need
to
improve
some
MPI
communication
and
load
balancing
and
continue
to
keep
up
with
new
physics
Editions,
but
Pearl
Mudder
is
already
enabling
XGC
simulations
that
are
providing
Insight
on
electromagnetic
fusion,
plasma
behavior
and
making
predictions
for
eater.