►
From YouTube: Massively Parallel PIC using WARPX
Description
Andrew Myers (CRD, LBL)
Massively Parallel PIC using WARPX
A
Okay,
all
right,
thank
you
very
much
for
giving
me
the
opportunity
to
talk.
I
assume
everyone
can
see
the
slides.
Yes,.
A
So
yeah
I'm
going
to
talk
about
warpacks
today
before
we
start
I
want
to
note
a
ministerial
show,
some
scaling
results
from
Pearl,
Mudder
and
also
from
Frontier,
and
these
are
pre-acceptance
so
just
generating
that
I
expect
that
every
Randy's
in
a
month
they
might
look
different
I
also
want
to
share.
So
there's
a
lot
of
people
who
work
on
warp
X.
This
is
kind
of
a
snapshot
of
what
the
current
team
looks
like
it's
a
multi-disciplinary
teams,
there's
people
with
like
physics,
applied
math
computer
science
backgrounds.
A
Most
of
us
work
here
at
Berkeley
lab,
but
there's
also
people
with
Livermore
and
slack
there's
a
number
of
collaborators
from
European
facilities
like
ceasa,
clay,
Desi
and
CERN,
and
there's
also
a
growing
number
of
collaborators
in
an
industry
that
are
using
work
bags
and
also
contributing
a
lot
of
great
things
back.
So
we
appreciate
that
I
also
want
to
mention
so
warpacks
is
one
of
the
nesap
codes.
A
So
just
some
I
guess
background
and
motivation.
So
the
main
application
of
warp
X
is
for
modeling
particle
accelerators,
and
you
know
when
I
and
I
think
when
most
people
think
of
particle,
like
so
accelerators,
they
think
about
something
like
the
LHC
at
Stern.
A
So
this
is
like
a
giant
building
size,
piece
of
equipment,
I
think
it's
like
27
kilometers
in
circumference,
or
something
like
that
and
it
accelerates
particles
to
Fantastic
energies
and
it's
used
for
like
Discovery
Science
and
in
that
it's
been
very
particle.
Accelerators
have
been
very
successful
in
that
and
have
enabled
a
lot
of
the
Nobel
prizes
that
have
been
awarded
in
both
physics
and
chemistry
over
the
years.
A
But
there
are
other
applications
of
accelerators
as
well
medical
applications.
For
example,
there
are
9
000
medical
accelerators
in
operation
worldwide.
These
are
used
for
things
like
radiation
treatment
for
cancer
or
in
the
production
of
medical
isotopes,
there's
also
about
20
000
industrial
accelerators
that
are
used
in
various
thing
capacities,
semiconductor
manufacturing,
sterilization
of
food
and
there's
also
a
number
of
like
National
Security
applications
as
well.
The
annual
value
of
all
products
that
use
accelerator
technology
is
estimated
to
be
500
billion.
A
So
the
point
we're
trying
to
make
is
the
opportunity
for
there's
a
bigger
impact.
There's
opportunity
to
make
a
bigger
impact
of
particle
accelerators
by
reducing
their
size
of
costs
and
modeling
plays
a
role
here,
because
it
allows
us
to
explore
and
understand
the
underlying
physics
and
also
Aid
in
just
tuning
the
design
of
specific
prototype
accelerator
designs.
A
So
the
next
generation
of
accelerators
needs
the
next
generation
of
HPC
modeling
tools.
So
a
potential
Avenue
for
improving
on
the
size
and
cost
of
accelerators
relies
on
this
plasma.
Acceleration
idea.
So
there's
a
couple
ways
of
doing
that,
but
you
can
fire
either
a
laser
beam
or
a
particle
beam
through
a
plasma.
It
transfers
energy
and
creates
these
electric
fields
in
the
plasma
that
have
this
weight
field
configuration
and
then
there's
a
beam
of
particles
traveling
in
that
wake
that
can
get
accelerated
to
high
energy
and
as
of
2019.
A
A
So
that's
nowhere
near
the
energy
that
led
like
the
LHC,
but
if
you
want
to
build
something
that
could
accelerate
things
to
like
the
multi
TV
scale,
the
idea
would
be
to
chain
a
bunch
of
these
individual
stages
together
and
you
would
need
to
like
the
laser
case.
Depleted
of
energy
you'd
need
to
like
inject
new
lasers
after
every
stage
and
there's
a
bunch
of
tuning.
That
needs
to
be
done
to
make
sure
the
beam
quality
is
maintained
as
it
passes
from
one
stage
to
another.
A
A
So
this
is
like
kind
of
warp,
X's
challenge
problem
and
so
far
we've
I
think
done.
The
first
simulation
of
this
kind
of
modeling
10
stages
of
a
laser
weight.
Philadelphia
accelerator
you
can
see
here
this
is
in
situ
visualization
I
believe
this
is
showing
the
transverse
electric
field
ISO
Contours
of
it,
and
then
it's
also
showing
the
particle
beam
colored
by
the
longitudinal
momentum.
A
So
this
was
an
in-situ
rendering
done
using
Ascent
Plus
btkm
and
we
were
able
to
do
a
convergence
study
of
this
using
three
to
768
gpus
per
run
in
the
convergent
properties
looked
look
nice
so
that
that's
kind
of
the
war
of
X.
You
know
challenge
problem
there,
but
on
top
of
that
there
are
a
number
of
other
application
areas
and
it's
like
it's
a
growing
list.
I
guess
so.
A
People
are
using
warp
X
to
study
laser,
ion
acceleration
to
look
at
plasma
confinement
for
Fusion
devices,
modeling
microelectronic
devices,
thermionic
converters
there's
also
an
effort
to
apply
workbooks
to
astrophysical
modeling,
so
yeah.
So,
although
it
was
designed
kind
of
with
particle
accelerators
in
mind,
it's
a
general
paint
code
and
can
be
used
for
other
things
as
well,
so
just
sort
of
an
overview
of
the
warp
X
code,
then
so
or
pick
code.
A
So
we
have
macro
particles
that
represent
collections
of
like
electrons
or
positrons
or
other
charged
particle
species
and
there's
also
a
mesh
on
which
we
store
the
electromagnetic
fields
in
the
current
density
and
charge
density,
so
warp
X
on
top
of
the
basic
pick
algorithms,
it
implements
a
number
of
advanced
features,
so
there's
the
ability
to
operate
in
a
Librarians
boosted
reference
frame.
A
There
are
high
order,
spectral,
solvers
support
for
embedded
geometries
support
for
mesh
refinement.
Etc
there's
also
a
number
of
multi-physics
modules
that
come
in
via
the
Pixar
Library.
So
this
models,
things
like
field,
ionization,
coulomb,
collisions
and
QED
processes
like
hair
creation,
for
example.
A
We
support
1d2d
and
3D
Cartesian
geometry
and
also
have
support
for
a
RZ
quasi-cylindrical
mode
like
this.
In
terms
of
the
parallelization,
we
use
a
hierarchical
approach.
So
there's
like
an
NPI
level
where
we
have
different
boxes,
and
this
is
like
3D
demand
decomposition.
Those
are
divided
on
the
different
API
ranks
and
we
can
also
do
Dynamic
load
balancing
by
shuffling
those
boxes
around
us.
We
want
it
and
then
on
a
given
node.
A
What
happens
depends
on
whether
we're
compiling
for
GPU
or
CPU
execution
for
GPU?
We
have
support
for
Cuda,
hip
and
sickle
back
ends,
and
then
we
also
have
support
for
an
openmp
backend
if
we're
doing
multi-threaded
calculations,
unlike
mini
core
architectures
and
finally,
there's
support
for
a
couple
different
kinds
of
scalable
parallel.
I
o
formats
and
also
support
for
the
in-situ
Diagnostics.
A
So
to
to
deport
warp
X
to
gpus
and
to
achieve
performance
portability
we're
using
the
amrx
library,
this
was
developed
as
part
of
the
ECP
X
Club
Computing
project,
in
addition
to
the
performance
portability,
it
also
does
handles
things
like
demand,
decomposition,
MPI
communication.
So
when
you
do
go
shell
exchanges
or
particle
redistribution
in
that
sandal
via
amrx,
it
also
provides
tools
for
the
mesh
refinement
aspect
of
work
effects
and
also
tools
for
getting
the
dynamic
load
balancing.
A
This
is
the
way
the
sort
of
data
structures
work
on
the
GPU
versus
the
CPU,
so
on
GPU
on
each
box
we
essentially
are
launching
Cuda
or
hip
or
DPC,
plus
plus
kernels,
and
the
threads
are
mapped
to
kind
of
either
the
different
cells
in
the
box
or
the
different
particles
in
the
box
and
process
them.
Concurrently
with
openmp,
we
have
an
additional
kind
of
layer
of
parallelism.
We
support
that.
A
So
the
bulk
of
the
support
for
gpus
is
done
through
these
parallel
floor
routines.
So
these
are
part
of
amrx,
it's
similar
to
what
is
provided
in
like
Cocos
or
Raja,
and
that
the
work
you're
expressing
is
done
via
this
Lambda
function
right
here
and
the
idea
is
depending
on
how
you
compiled
the
code,
it
will
specialize
it
for
a
specific
platform.
A
So
what
you're
seeing
here
is
both
the
scaling
of
the
code,
which
looks
quite
nice
up
to
2048
nodes
and
also
the
benefit
that
we
get
from
running
on
the
gpus,
which
I
think
this
is
like
a
factor
of
30
or
something
Improvement
on
this
problem.
I
should
say
so.
This
is
these.
A
These
results
are
for
V100,
but
if
we
run
the
same
thing
on
a100,
we
get
an
additional
it's
almost
a
factor
of
two
Improvement
comparing
D100
to
a100,
and
that
was
nice
because
it
was
a
fair
bit
of
work
to
Port
warp
X
to
use
that
you
could
use,
but
once
we
did
that
and
had
it
running
well
on
Summit
it
just
it
just
ran
without
any
code
modifications
on
Pro
matter,
and
it
was
almost
twice
as
fast.
So
that's
that's
nice.
A
A
So
a
bit
about
the
porting
to
GPU
process.
So
in
order
to
use
these
parallel
four
routines,
we
had
to
Port
the
kernels
and
warpacks
from
four
grand
to
C
plus
cut
or
that
backwards.
In
the
slide
we
had
to
Port
them
from
four
grand
to
C,
plus
plus
so
on
the
left.
This
is
what
a
routine.
This
is
one
of
the
finite
difference.
Solvers
that's
available
in
warpacks,
updating
the
electric
build
in
the
y
direction.
A
This
is
what
it
looked
like
in
Fortran,
and
this
is
what
it
looked
like
in
C,
plus
plus
the
the
original
warp
X
code
was
a
mix
of
C
plus
plus
and
Fortran
code,
and
the
fourth
Grand
was
kind
of
for
the
computationally
expensive
kernels
that,
like
crunch
the
numbers,
so
I
guess
there's
two
points.
I
want
to
make.
What
one
is
that
other
than
so
other
than
this
Loop
over
the
cells?
A
The
actual
update
that
does
the
math
here
is
almost
exactly
the
same
between
the
Fortran
and
the
C
plus
plus,
and
this
is
facilitated
by
this
amrx
array.
4
multi-dimensional
array.
That's
just
designed
to
be
used
like
as
much
like
working
as
possible,
and
you
know
we
did
comparisons
of
the
overhead
between
this
C
plus
multi-dimensional
array
and
Fortran,
and
it's
it's
basically
nothing
on
on
a
CPU
execution.
A
So
there
was
some
upfront
cost
in
development
time
for
doing
the
support.
But
now
we
have
a
single
code
base
that
works
for
NVIDIA,
AMD
and
Intel
gpus
and
still
supports
mini
core
and
again
once
this
process
was
made
for
V100.
Getting
the
code
to
run
on
a100
and
on
the
mi-200s
that
are
on,
like
pressure
was
relatively
pain-free.
A
So
I
want
to
talk
a
bit
about
so
warbex
is
one
of
the
finalists
for
the
Gordon
Bell
prize
this
year
and
I
wanted
to
so
they
let
us
on
Frontier
in
order
to
do
some
bigger,
runs
and
see
how
the
goods
scaled
there.
And
then
we
also
had
access
to
Pearl
Motor
through
the
Nissan
project,
and
they
were
also
able
to
do
some
big
runs
on
both
fugaku
and
Summit
and
what
the
simulation
was.
It
was
basically
one
of
those
plasma
acceleration
stages.
So
it's
a
setup
kind
of
like
this.
A
You
have
to
lay
their
pulse.
You
have
the
Wake
field,
there's
a
beam
of
particles
behind
it
and
what
we
got
was.
We
were
seeing
quite
nice
weak
scaling
on
basically
up
to
the
full
number
of
nodes
that
are
available
on
these
machines
on
Frontier
in
ugaku,
we're
getting
about
like
85
to
90
percent,
weak
scaling
efficiency
and
then
on
Perl,
Mudder
and
Summit.
It's
maybe
more
like
75,
but
that's
at
like
almost
the
full
scale
of
the
machine.
A
So,
in
order
to
kind
of
compare
performance
across
different
machines,
we
use
this
figure
of
Merit,
which
is
basically
a
measure
of
the
number
of
particles
that
you
can
update
in
a
given
unit
of
time,
and
here
the
figure
of
Merit
goes
up.
If
either
you
run
a
bigger
problem
in
the
same
amount
of
time
or
you
run
the
same
problem
faster,
it's
kind
of
has
both
of
those
things
rolled
into
it.
A
But
what
this
chart
is
showing
is
kind
of
our
progress
in
this
measure
over
time,
and
you
can
see
so
in
transitioning
from
the
original
warp
code
to
the
warp
X,
which
is
the
amrx
port.
There
was
a
nice
Improvement
there
and
then
in
transitioning
warpacks
from
Corey
to
take
advantage
of
the
gpus
on
Summit.
There
was
again
a
nice
Improvement,
and
then
you
know
over
the
years
we
made
some
optimizations
I
think
there's
also
some
times
where
we
go
backwards
in
here
too.
A
But
you
know,
overall,
the
improvement
over
our
in
over
the
pre-ecp
Baseline
is
about
a
factor
of
500,
comparing
what
we
got
on
Frontier
to
what
we
got
with
the
original
work
code
on
Corey,
and
it's
also
about
a
factor
of
100,
just
comparing
where
we
were
in
2019,
with
warp
Exxon
Curry
to
what
we
saw
in
Frontier
now
so
yeah
I
like
this,
because
it
kind
of
shows
there's
machines
with
AMD
GBS,
there's
machines
with
Nvidia
gpus
and
there's
also
mini
core
machines
as
well.
A
Another
thing
I
wanted
to
mention
is
so
we
sort
of
you
know
it
implied
that
you
could
just
use
this
parallel.
Four
constructs
and
Port
all
your
kernels
that
way,
and
for
the
majority
of
the
kernels
and
warfx.
That
is
all
we
did
for
some,
particularly
like
performance,
critical
kernels.
It's
worth
it
to
do
some
extra
tuning
and
an
example
of
that
is
in
the
kind
of
core
particle
mesh
routines
and
warp
X,
the
ones
that
do
the
current
deposition
in
the
field.
Gathering.
A
What
we
found
was
so
first
of
all,
both
of
these
kernels
are
heavily
memory
bound,
but
if
you
don't
do
any
particle
sorting
or
anything
like
that,
you're
bound
by
the
bandwidth
between
hvm
and
the
processors,
if
you
do
occasionally
sort
the
particles
in
their
cell
order,
so
that
they're
ordered
a
memory
the
same
way.
A
A
A
And
I
believe
that
I'm
getting
short
on
time,
so
I'll
skip
the
next
couple
slides.
This
is
just
saying
that
all
our
development
takes
place
on
GitHub
and
we
fall
on
open
source
development
model
in
our
documentations
online
as
well,
and
with
that,
thank
you
and
I'd
be
happy
to
answer
any
questions.
A
Thank
you
very
much.
I
think
there
is
a
couple
of
questions
in
the
chat.
You
could
just
look
at
that.
One.
A
A
Then,
okay,
what
is
the
most
costly
step
of
a
basic
warpack
simulation
is
usually
the
current
deposition,
depending
on
how
many,
if
you
have
a
very
low
number
of
particles
per
cell,
you
can
get
in
a
regime
where
the
communication
costs
are
the
highest
thing.
But
assuming
that
you
have
a
good
number
of
particles,
it's
current
deposition
and
then
for
what
you
use
to
solve
for
the
EM
fields.
A
There's
a
couple
different.
So
there's
there's
a
couple
different
finite
and
different
solvers.
There's
the
e-solver
and
the
Cole
harkinen
solvers,
implemented
in
warp
X
and
there's
also
a
spectral
method
called
the
pseudo-spectral
analytic
time
domain
method.
That's
an
option
as
well!
So
there's
a
few
different
choices
you
have
for
solving
for
the
for
the
deals.