►
From YouTube: 8 Case Study 3: AMReX (NERSC Cori KNL Training 6/2017)
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Hi,
my
name
is
Brian
here
earlier
this
morning.
This
is
a
less
of
a
case
study
and
more
of
a
work
in
progress.
This
is
a
relatively
recent
undertaking
and
so
in
that
way,
it's
a
little
more
exciting,
but
I
also
don't
have
nice
performance
plots
like
ZZ
head.
So
it's
just
I'm
sort
of
thinking
through
how
to
solve
this
interesting,
algorithmic
problem,
and
so
that's
what
I've
documented
here
a
little
bit
so.
A
It's
derived
from
box
live,
which
was
the
previous
framework
for
doing
things
like
this,
and
also
as
many
elements
of
the
charmville
framework,
which
is
also
developed
here.
It
AM
Rx,
is
a
framework
and
so
science
applications
sit
on
top
of
it
and
those
applications
do
many
things.
Most
of
them
are
fluid
dynamics
of
some
sort,
there's
reacting
flows,
so
combustion
chemical
combustion,
nuclear
combustion
for
astrophysics
and
terrestrial
applications
that
you
load
Mach,
number
flows,
compressible
and
incompressible
things.
There's
even
an
N
bought
there's
a
there's,
a
particle
element
of
a
more
X.
A
So
if
you're
doing
particle
grid
or
particle
fluid
interactions,
you
can
do
that
and
then
they're
also
working
on
some
really
interesting
things
with
embedded
boundaries
and
and
complex
geometries,
so
yeah.
This
is
all
the
MRX
has
many
players
in
it,
and
this
is
this
as
part
of
the
de
exascale
computing
project.
A
So
one
of
the
applications
that
I've
been
working
on
for
the
last
several
months,
a
few
years,
in
fact
it's
called
Nick,
which
is
a
cosmology
code.
Nick's,
is
actually
a
hybrid
of
two
sort
of
algorithms.
There's
a
compressible
flows,
algorithm
and
there's
also
an
N
body
component.
So
Nick's
does
cosmological
simulations
that
in
court,
both
Baryonyx
matter,
which
is
to
say,
fluid
that
has
it's
a
fluid
that
collides
with
itself.
A
So
it's
used
for
simulating
early
universe,
things
for
for
simulating
the
for
the
formation
of
filaments
in
the
universe,
Dark
Matter,
halos,
a
GN
feedback
stuff.
There's
there's
a
lot
of
applications
of
this
code.
So
actually
this
it
doesn't
show
up
that
well,
I!
Think
on
the
screen,
but
that's
a
this
is
a
simulation
of
filaments
forming
that
I
actually
did
on
Cori
not
that
long
ago,
and
it's
part
of
a
movie
and
it
looks
really
nice.
A
So
there
are
a
number
of
different
sort
of
algorithms
that
are
all
operating
at
the
same
time.
In
Nix
there's
you
can
roughly
narrow
down
to
four.
There
are,
as
I
mentioned,
the
compressible
flows
component
and
then
the
the
n-body
component,
those
relatively
speaking,
are,
are
fairly
cheap
to
compute,
there's
a
so
the
compressible
flows
algorithm.
It
requires
some
information
from
nearest
neighbors
of
cells,
and
so
it's
sort
of
like
a
stencil.
A
The
n-body
stuff
is
much
more
chaotic,
it's
you're,
basically
pointer
chasing
because
the
particles
can
kind
of
go
anywhere
and
you
have
to
figure
out
where
they
are
on
the
grid.
So
that's
much
more
sort
of
random
access.
It's
a
very
memory,
latency
bound
algorithm,
but
the
reason
it's
cheap
is
because
the
particle
density
is
actually
very
low.
So,
whereas
in
in
particle
and
cell
codes
for
fusion,
for
example,
the
particle
density,
the
number
of
particles
per
grid
cell
can
be
extremely
high
in
cosmology,
it
is
sort
of
the
opposite.
A
It
can
be
of
order
ten
or
even
one.
So
that's
the
reason
it's
cheap.
It's
not
that
the
algorithm
is
cheap.
It's
that
we
have
the
particle
density
is
quite
low,
there's
also
a
multi
grid,
solver
in
Nix.
That's
done!
We
they
do
a
Poisson.
They
solve
a
Poisson
equation
for
gravity,
so
that's
our
elliptic
solver
that
is
relatively
cheap,
except
at
scale
which
is
fairly
typical
of
multigrid
solvers
because
they
have
it.
There's
a
lot
of
communication.
A
That's
required
as
you
coarsen
and
as
you
course
in
grades
and
then
and
then
and
then,
as
you
do,
restriction
and
then
prolongation
back
up
to
the
fine
grids.
There's
a
lot
of
communication
that
goes
on
and
that
becomes
very
expensive
at
scale.
So,
except
for
running
at
the
largest
scales,
for
example,
on
half
or
all
of
Koree,
the
multi
grid
component
is
relatively
cheap.
The
really
expensive
part
is
actually
radiation,
which
would
not
come
as
a
surprise
to
anybody.
Who's
actually
done
radiation
transport
before
that.
But,
interestingly,
there
is
actually
no
transport.
A
That's
done
in
Nick.
That's
all
point-wise
calculations,
that's
basically
micro
physics,
so
there's
no
stencils
involved
or
anything
like
that,
but
it's
still
extremely
expensive
at
all
scales.
It's
basically
independent
of
scale,
because
it's
a
point,
wise
calculation
and
so
I'll
show
you
why
that
is
and
then
and
then
why
the
way
I'm
trying
to
fix
it
for
KL,
especially
so
in
every
cell.
A
In
a
Nix
calculation,
we
have
to
solve
a
heating
cooling
equation,
which
decides
it
takes
the
the
ionization
state
of
some
number
of
atoms
and
it
confirm
that
it
computes
how
much
radiation
is
deposited
into
the
gas
and,
conversely,
how
much
is
cooled,
how
much,
basically,
how
the
atoms
interact
with
the
radiation
and
it's
a
it's
a
first.
You
can
write
it
as
a
first
order
differential
equation.
A
That
looks
like
this
thing
right
here,
which
I
made
quite
simple,
but
this
equation
is
a
little
bit
strange,
because
every
time
you
evaluate
the
right
hand
side
this
f
of
Y
and
T,
you
actually
have
to
solve
an
equilibrium
constraint
equation
which
depends
itself
on
F.
So
every
time
you
evaluate
the
right
hand
side,
you
have
to
do
some
kind
of
route
find.
A
This
G
of
F
is
a
highly
nonlinear
function,
and
so
normally,
when
you
have
so
it's
worth
saying
that
this,
even
though
it's
a
there's
only
one
equation,
this
is
not
a
system
of
differential
equations,
it's
just
one,
but
it
can
be
highly
sensitive
to
the
the
the
dependent
variable
there.
So
you
can
see
large
spikes
in
the
solution
as
you
evolve
it
in
time,
so
you
can
sort
of
characterize
it
as
stiff.
A
Step
in
your
solution
will
sort
of
spike
all
over
the
place
and
RK
doesn't
rk4,
doesn't
capture
that
or
it
may
not
capture
that
and
lets.
You
take
extremely
small
time
steps
so
I
for
this
reason,
because
we
had
generally
have
to
evaluate
the
right
hand
side
many
times
and
because
every
right
hand,
side
evaluation
requires
doing
a
route
find.
This
is
in
fact
the
most
expensive
component
in
all
of
NICs
is
solving
a
so
ve
and
it
has
to
be
done
in
every
cell
so
to
refresh
people's
memory.
A
A
They
include,
for
example,
the
lack
of
vectorization
between
iterations,
so
within
a
cell,
you
can't
really
vectorize
much
of
anything
because
there's
a
data
dependence,
so
as
you
solve
for
this
parameter,
as
you
try
to
converge,
this
lambda
toward
0,
the
solution
of
lambda,
n
plus
1,
depends
on
the
solution
at
way
of
lambda
n,
so
that
data
dependence
makes
vectorizing
along
the
iteration
axis,
not
possible
and
so
effectively
not
effectively.
In
reality,
all
of
this,
all
of
these
OD
integrations
in
every
cell
are
done.
A
A
I
don't
have
my
my
watch
with
me
23
okay,
so
we
can't
really
vectorize
within
a
cell,
and
so
the
solution
that
word
that
were
implementing
right
now
is
actually
vectorizing
across
cells,
so
in
effect
we're
trying
to
we're
trying
to
integrate
eight
of
these
OD
es
simultaneously,
which
is
an
interesting
challenge
that
I've
never
encountered
before
it's
a
lot
of
fun,
so
Cindy
late,
so
unlike
thread
symbol,
Ames
operating
lockstep.
So
if
you
have
a
threaded
region,
your
threads
can
diverge
in
their
workload
and
do
different
things.
That's
not
a
problem.
A
Cindy
lanes
can't
do
that
as
they
have
to
do
the
same
thing
in
every
Lane,
and
so,
if
your
algorithm
has
any
kind
of
variability
due
to
a
convergence
criterion
or
something
like
that,
that
makes
doing
Cindy
a
little
bit
challenging
and
that
and
variability
is
certainly
a
component
of
what
we're
trying
to
do
here.
There's
actually
two
chief
sources
of
variability.
A
Among,
for
example,
8o
des
from
8
neighboring
cells,
the
first
one
is
that
the
number
of
iterations
to
do
the
number
of
newton
iterations
that
you
have
to
do
to
satisfy
that
equilibrium
constraint,
is
unique
to
the
initial
conditions
they
into
the
right-hand
side
of
every
cell.
So
the
number
of
roots
of
the
number
of
iterations
to
do
your
root
find
might
be
different
in
neighboring
cells.
A
The
other
source
of
variability
is
that
we
use
a
stiff
integrator
to
to
integrate
this
differential
equation
and
it's
actually
an
adaptive
method,
and
so
the
number
of
steps
that
it
will
take
to
to
integrate
from
time
t0
to
t1
may
be
different
for
two
neighboring
cells.
So
again,
there's
it's
challenging
to
handle
that
kind
of
variability
in
Cindy,
but
we
have
some
methods
for
doing
that.
A
So
what
we
do
to
vectorize
the
Newton
solve
is
is
sort
of
we.
We
do
a
few
extra
sort
of
pointless
flops
in
order
to
do
everything
in
Cindy.
That's
a
fairly
common
pattern
in
writing.
Vectorizable
code
in
particular.
What
we
do
is
we
just
keep
iterating
until
the
slowest
converging
cell
has
converged,
so
you
might
have
one
cell
that
only
takes
2
Newton
iterations
to
get
to
zero
and
the
cell
right
next
to
it.
A
Take
9,
while
the
cell
that
got
there
first
is
just
going
to
keep
bisecting
it
or
not
bisecting,
but
it's
going
to
keep
trying
to
do
that.
Root
fine
and
get
even
closer
to
zero.
So
the
hope
is
that
it
won't
actually
diverge,
which
theoretically
is
possible
to
do
in
our
test
that
actually
these
are
fairly
well
behavior
group
finds,
so
they
don't
do
that.
So
that's
how
we
deal
with
that
kind
of
with
that's
how
we
vectorize
the
right
hand.
Side
is
actually
vectorized.
8,
simultaneous
newton
solves
the
other.
A
As
for
the
other
source
of
variability,
we're
vectorizing
the
the
we're
trying
to
vectorize
the
stepping
routine
itself
and
it's
a
similar
spirit,
so
a
well
a
well
behaved.
Ote
might
only
take
3
steps
to
get
from
time,
t0
to
t1
in
that
integration
and
another
one
might
take
10,
and
so
what
we
do
is
we
just
pick
the?
What
we're
trying
right
now
is
we
just
pick
the
slowest,
whatever
the
smallest
time,
step
that
it
needs
to
take
in
order
to
satisfy
the
convergence
criteria.
A
We
just
do
that,
so
even
for
a
well-behaved
OTE,
we
make
it
take
really
small
time
steps,
just
like
all
the
others,
and
that's
just
to
keep
that
so
everybody
can
operate
in
lockstep.
That
may
not
be
that.
That's
not
our
final
answer.
That's
sort
of
one
thing
that
we're
trying
I
think
there
are
better
things
coming
in
the
pipeline,
but
that's
sort
of
a
first-order
approach
to
to
solving
this
problem,
but
again
we're
doing
a
lot
of
extra
floating-point
operations
that
we
in
theory.
A
We
don't
need
to
do,
but
the
net
benefit
is
still
worth
doing,
because
you've
done
a
few
extra
that
you
don't
need,
but
you've
also
are
doing
eight
at
a
time,
so
the
speed-up
overall
from
doing
this,
will
certainly
not
be
a
perfect
factor
of
eight
but
it'll.
Be
a
larger
than
one
as
opposed
to
doing
it
all
on
scalar,
which
is
how
it's
done
right
now.
A
That's
unreadable,
so
I
will
skip
that.
I
think
this
is
also
pretty
unreadable.
So
this
is
just
the
last
bit
of
the
Newton
of
the
Newton
method.
That's
in
Cindy
now,
so
you
can
see.
Here's
the
Newton,
here's
the
right-hand
side
of
a
Newton
iteration,
where
you
take
the
function
divided
by
its
derivative
and
then
all
we
do
there's
actually
a
bug
in
this.
This
last
loop
doesn't
need
to
be
there.
A
Okay,
so
one
final
point:
there's
if
you're
using
an
explicit
integrator,
that's
still
adaptive
like
rkf
for
five,
this
ones
that
rkf
for
five
is
actually
pretty
straightforward
to
vectorize.
That's
it's
a
it's
an
ADA
great!
It's
an
integration
scheme.
It's
probably
the
simplest
adaptive
integration
method.
A
That's
because
explicit
methods,
they
just
evaluate
the
right
hand,
side
and
that's
how
they
that's
how
they
decide
when
they're
converged.
If
you
do
something
implicit
like
boad,
which
uses
BDF
implicit
methods
are
much
more
complicated
because
generally
they're
solving
some
kind
of
linear
system
at
every
integration
step,
because
implicit
methods
lead
to
a
Jacobian
matrix
that
you
would
then
have
to
solve
for
and
so
solving
eight
systems
of
linear
equations
in
scindia's
is
not
so
straightforward.
A
A
B
Another
question
short
go
ahead,
so
it's
kind
of
analogous
to
this
parallel
and
time
stuff.
Had
people
been
working
on,
I,
wonder
even
consider
parallelizing
a
gallop
iteration
solve
five,
something
scary,
a
bunch
of
points,
because,
if
you
have
two
logarithms
are
not
quite
right.
Other
things
that
you're
moving
from
G
the
next
innovation,
not
by
life
where,
if
you
just
kind
of
situate
different
tries
for
each
iteration
pick
the
best
one
that
that
is
a
starting
point
to
the
next
generation.
A
Data,
that's
a
good
point.
I
hadn't
tried
that
yet,
but
I
I
see
no
reason
why
that
wouldn't
work
so
maybe
that'll
be
step
two.
So
thanks.
Thank
you
for
the
comment.
Look.