►
From YouTube: 9 Case Study 4: MFDn (NERSC Cori KNL Training 6/2017)
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Many
body
yet
many
body
configuration
interaction,
solver
and
the
code
itself
is
a
it's
a
Fortran
90
code
with
a
very
small
amount
of
C
kernels
that
may
or
may
not
be
used,
but
otherwise
it's
totally
self-contained
with
its
own
custom.
Eigen
solvers
and
only
external
dependencies
are
our
blobs
and
lay
pack
and
it's
an
application.
That's
in
use
at
a
at
a
number
of
do-e
centers.
A
A
A
A
A
So
this
this
part
of
the
code
is
really
kind
of
embarrassed
embarrassingly
parallel,
every
every
process
just
needs
to
write
or
construct
figure
out
which
matrix
elements
that
are
nonzero
that
it
owns
and
then
actually
calculate
them.
So
this
this
part
of
the
code
is
really
mostly
just
integer
operations,
branching
and
lookups
into
yeah,
reading
out
of
big
look-up
tables
for
computing.
The
matrix
elements.
A
A
So
a
typical
loop
structure
in
this
code
is
to
to
look
through
all
of
your
column,
States
and
all
of
your
row
states
and
then
decide
if
that
is
a
non
zero
element
with
a
kind
of
a
fast
method.
So
you
could
use
you
could
count
the
number
of
zeros
in
after
you
XOR.
The
two
different
sets,
the
let's
say
the
first
16
bits
of
the
bit
representation.
A
You
could
XOR
those
and
then
count
how
many
non-zeroes
are
left
has
a
quick
check,
and
then
you,
if
you
pass
the
fast
check,
then
you
do
a
more
detailed
comparison,
applying
the
full
quantum
mechanical
selection
rules,
which
could
be
quite
complicated
and,
depending
on
what
phase
of
initialization
you're
in
you
might
optionally,
then
actually
compute.
That
matrix
element.
A
A
Unfortunately,
the
hop
count
instruction
on
candles
only
works
on
integer
registers,
which
has
been
quite
a
problem
for
us.
So
in
some
cases,
we've
done
better
by
manually
coding
with
intrinsics
our
own
pop
count
instruction,
but
it's
really
tough
to
do
better
than
an
instruction,
even
with
all
the
extra
copies
that
you
have
to
do,
but
yeah
so
manual
tiling
and
then
using
openmp
470
directives
to
force
the
compiler
to
actually
generate
vector
instructions
where
it
should
give
us
something
like
a
20%
speed-up.
A
A
And
just
a
little
bit
about
the
actual
format
of
the
matrix,
it's
in
a
custom
format
called
a
press,
sparse
blocks,
coordinate
format,
and
that's
that's
due
to
a
couple
reasons.
So
in
this
code
you
have
to
do
a
a
matrix,
vector
operation,
but
you
also
have
to
do
the
transpose.
So
if
you
do
compress
parse
row
or
compress
Sparx
column
you'll
be
very
fast
one
way
and
slow
the
other
way.
But
if
you
do
a
compressed,
sparse
blocks
that
are
chosen
nicely
for
your
cache
size,
you
can
iterate
through
and
either
either
dimension.
A
A
A
So
one
of
the
one
of
the
main
things
we
did
to
speed
things
up
on
KL
was
get
get
rid
of
the
SP
MV
and
switch
it
to
an
SP
mm.
Let's
say
if
you
have
a
stencil,
you're,
applying
multiple
right
hand,
sides
and
then
I
guess
this
table
just
shows
that
the
number
of
right
hand
sides
increase.
It
increases
your
aromatic
intensity
by
almost
a
factor
of
four,
so
much
better
data
reuse
and
we're
we're
getting
much
higher
flops.
A
So
we
we
use
the
the
directed
fax
amount
directives
and
Fortran
to
explicitly
put
those
just
those
two
rays,
so
we
stream
a
through
out
of
em
CDR
out
of
DDR
and
then
just
operate
on
the
X&Y
captain,
MCD
RAM
and
actually,
in
this
case
it's
sort
of
a
funny
coincidence,
but
the
the
ratio
of
data
movement
on
DDR
and
MC
DRAM.
We
can,
by
choosing
number
of
right-hand
sides,
we
can
kind
of
match
that
ratio
to
the
ratio
of
bandwidth
that
de
R
+
MC
d
RM
can
provide.
A
So
we
can
get
a
pretty
high
fraction
of
sustained
bandwidth
from
both
DDR
N
and
n
MC
degree.
At
the
same
time-
and
this
is
just
a
little
model
showing
showing
how
well
we
can
do
so-
the
blue
line
is
the
kind
of
maximum
expected
performance
that
we
would
expect
if
we
could
fully
utilize
both
memories
and
once,
depending
on
the
number
of
right-hand
sides
that
we're
using
you
can
see
with
it
with
eight
right-hand
sides
we're
getting
pretty
close
to
the
best
we
could
hope
to
do,
which
is
good
because
anything
past
eight.
A
A
Block
preconditions,
method
and
then
just
some
kind
of
bigger,
broader
takeaways,
that
in
this
specific
case
we
did,
we
did
get
a
speed-up
over
using
just
cache
mode
for
a
very
memory
intensive
code,
but
that's
not
necessarily
true
in
every
case,
and
it
also
requires
a
lot
more
effort
on
our
part
to
to
manage
where
the
data
is
actually
allocated.
So,
in
general,
the
cache
mode
is
within
ninety
percent
of
the
flat
mode
code
and
then
another
thing
we
found
as
we've
scaled.
This
up
is
that
you
really
need
at
least
for
rent
processes.