►
From YouTube: VASP Workshop at NERSC: Parallelization
Description
Presented by Martijn Marsman, University of Vienna
Published on December 18, 2016
Presented at the 3-day VASP workshop at NERSC, November 9-11, 2016
A
Let's
see,
ok
well
see,
ok,
anyway,
so
does
everybody
know
what
an
MPI
rank
is
well
MPI
say
if
you
use
MPI,
you
start.
You
start
these
independent
processes
that
we
call
MPI
ranks
and
that
will
essentially
run
copies
of
the
program
that
communicate
with
each
other.
So
explicit,
there's
all
kinds
of
instructions
that
inside
of
the
program
to
to
divide
the
work
to
divide
the
data,
and
things
like
this
and
there's
explicit
points
where
these
processes
hook
up
and
exchange
information.
A
So
in
that
so
that
there
are
the
MPI
ranks
these
this
individual
processes
that
you
start
so
and
if
we
look
at
how
the
work
is,
is
parallelized
and
how
that
how
the
work
is
explicitly
distributed
over
over
these
MPI
ranks,
then
at
the
highest
level
of
parallelization,
which
is
actually
an
optin
optional
level.
We
distribute
over
these
K
points
that
we
have
been
speaking
about.
A
So
why
is
this
the
highest
level,
because
it's
the
highest
level
in
the
sense
that
let's
consider
that
we
have
8
MPI
ranks
and
we
want
to
divide
the
group,
the
work
out
for
these
ranks
at
the
highest
level
we
create
out
of
these
8
MPI
ranks,
for
instance,
two
groups
of
four
ranks
right.
So
in
that
sense
it's
the
highest
level,
because
it's
the
first
division
of
our
total
number
of
envy
thanks,
but
it's
an
optional
one.
A
So
at
this
highest
level,
for
instance,
if
you
want
to
distribute
your
workout
for
K
points,
you
would
use
this
tag
set
K
part
to
something
else
than
one,
and
if
you
set
this,
for
instance
to
two,
you
create
two
subgroups
of
of
MPI
ranks
and
the
first
group
would
work
on
the
first
K
point,
for
instance,
and
the
second
group
on
the
second
one
and
the
first
one
on
the
third
and
and
so
on
and
so
forth.
So
the
work
on
these
K
points
gets
distributed.
A
The
data
is
not
distributed,
so
the
data
structures
were
not
set
up
to
distribute
the
data
over
these
groups,
so
data
is
replicated
so
the
work
is
distributed.
The
would
the
work
on
a
particular
K
point
gets
done
by
a
particular
group,
but
the
information
this
group
holds
all
information
for
all
K
points,
and
that
is
obviously
a
bit
of
a
problematic
strategy.
I
mean
the
problem
is
that
that
redesigning
data
structures
is
always
much
more
work
than
doing
something
like
this
site.
A
So
this
is
the
way
this
was
implemented
added
later
on,
but
replicating
data
of
these
groups
of
course
means
that
your
memory
demands
will
rise.
So
you
have
a
lot
of
K
points
and
you
create
a
lot
of
these
groups,
so
let's,
for
instance,
imagine
ourselves
to
be
on
a
single
note
that
has
a
certain
portion
of
memory.
A
If
I
now
start
a
number
of
ranks
on
these
groups
and
use
this
caper
to
defy
to
work
over
K
points
among
among
these
MPI
ranks
each
and
every
one
of
these
ranks
will
allocate
the
full
amount
of
memory
for
the
wave
functions
so
that
doesn't
get
distributed
to
over
K
points.
So
your
memory
demands
increase
linearly
with
with
k
bar.
A
So
why
is
this
nice
to
use?
If
you
have
the
memory?
It's
nice
to
use
this,
because
it
because
it's
such
a
high
level
of
parallelization,
that
is
hardly
any
communication
involved.
Many
many
operations
that
the
program
does
depend
only
on
on
the
K
point
itself,
so
it
can
work
on
this
work
on
this
and
and
only
at
at
very,
very
very
few
occasions.
Does
it
need
information
from
these
other
groups
that
work
on
other
K
points,
which
is
very
nice.
B
C
A
The
charge
density
already
that
would
for
at
least
four
DFT
there
would
be
really
no
need
to
communicate
anymore
and
because
Kohn
some
equations
at
at
a
certain
K
point.
They
only
coupled
to
two
corresponding
equations
at
another
K
point
through
the
density
so
you've
at
the
density
you
keep
it
fixed.
Then
it
would
be
really
almost
no
need
to
communicate.
It.
E
A
A
Actually
theirs,
so
they
do
this,
they
synchronize,
but
only
at
very,
very
limited
instances.
They
do
update
the
information
of
the
other
guys.
Yes,
yes,
that
is
true.
So
what
does
this
mean?
This
means,
for
instance,
that
each
group,
so
if
they're
synchronized,
so
they
work
on
the
K
point
on
the
wave
function
at
a
certain
K
point
and
another
group
works
on
another
one.
Then
they
synchronize
the
wave
functions
each
can
use
exactly
the
same
algorithms
to
compute
the
density.
A
They
have
all
the
informations
to
do
this,
so
that
I
mean
it's
not
not
saying
that.
There's
not
another
way
to
do
this,
and
and
there's
not
a
better
way
to
do
this,
but
this
is
the
way
it
was
it.
This
was
the
quickest
way
to
do
it
and-
and
it
was
added
as
more
or
less
as
as
an
afterthought,
so
why?
Why
is
this?
A
Not
wife
is
not
the
first
thing
that
you
do
if
this
is
not
the
first
strategy
that
we
have
followed,
because
if
you
go
to
large
systems,
you
will
have
less
and
less
K
points
normally.
So
this
is
not
the
thing
that
you
would.
It
lends
itself
for
a
very
efficient
parallelization,
but
under
many
circumstances
it
doesn't
bring
a
lot
because
if
you
have
a
very
large
system,
you
have
only
one
K
point.
Well,
this
I
couldn't
parallelize
over
this
right.
A
Right
so
so,
so
it's
not
saying
that
this
is
in
any
way
a
perfect
strategy,
but
that
is
the
way
it
is,
and
that
is
the
highest
level.
So
it's
the
first
division
that
we
make
in
our
NPI
ranks,
so
the
default
level
of
parallelization
is
over
orbitals
all
for
this
Kong
Hong
Kong,
some
orbitals,
alright,
so
that
is
that
is
within
these
groups.
So
let's
assume
that
we
have
only
one
for
the
sake
of
simplicity,
okay,
part
one.
A
That
is
the
way
you
could
distribute
your
data.
Obviously,
where
would
this
run
into
limitations?
Imagine
that
you
have
a
very
large
cell
and
you
put
in
one
molecule
and
so
or
or
or
maybe
only
an
atom.
You
put
a
free
atom
in
a
very
large
cell,
so
this
for
this
atom.
You
would
have
a
very
limited
number
of
wave
functions.
That
would
be
computed.
Maybe
four
right,
but
each
of
these
each
of
these
functions
would
have
a
very
large
number
of
plane
wave
coefficients
now,
because
you
have
a
very
large
cells.
A
So
within
and
so
now
here
we
have,
our
MPI
ranks
are
assigned
to
to
work
on
particular
bands,
but
we
can
do
a
further
further.
We
can
combine
this
with
a
parallelization
of
a
plane
wave
coefficients
where
I
say:
okay,
now,
I
care,
for
instance,
to
MPI
ranks
that
together
are
responsible
to
work
on
my
first
wave
function
and
then
the
next
to
MPI
ranks
are
responsible
to
work
on
the
second
wave
function
and
and.
E
E
A
Yeah
so
that
that
would
be
our
lowest
level
of
parallelization
that
you
and
then
you
can
make
any
combination
of
these
two
and
that
is
controlled
by
the
tag
and
core
I.
Don't
know
why
I've
written
text-
oh
yeah,
because
there's
another
one
that
control
the
same
thing
but
incor
is
the
one
that
I
would
advise
you
to
use,
because
I
find
the
meaning
of
the
tag,
the
easiest
to
explain
and
the
other
one.
Does
it
exactly
the
same,
but
in
a
very
obscure
way.
A
Okay,
so
that
is
sort
of
so
that
has
a
few
comfort
consequences
and
and
I
try
to
try
to
illustrate
it
here
and
it's
not
not
not
so
important,
but
I
would
like
to
would
like
to
sort
of
try
to
elucidate
what
I've
mentioned
before.
So
this
would
be
a
rank
number
one
and
number
two
and
I've
two
of
them
and
nCore
is
one
and
in
that
situation
my
first
function
resides
on
Quora
one.
A
So
we
do
a
completely
distribution
of
the
data
to
a
situation
where
I
said
that's
done
internally.
You
don't
have
to
have
to
care
about
this,
but
we
do
a
complete
redistribution
of
the
data
where
now
MPI
rank,
one
will
hold
half
of
each
of
these
functions
and
the
second
one
second
rank
the
world,
the
other
half,
and
that
makes
this
kind
of
of
evaluations
much
easier
because
those
are
done
on
a
plane
wave
by
plane
wave
basis.
So
they
can
each
now
be
responsible
for
parts
of
this
matrix.
A
I
had
to
do
some
to
work
on
those
we
do
parallel,
f50s
so,
and
so
they
work
together
to
do
Fourier
transforms
of
the
of
the
bands
which
necessary,
obviously
because
they
share
the
coefficients
and
to
do
such
things.
Those
contractions
there's
also
all
communication
again.
That
brings
such
a
situation
again
in
in
a
situation.
F
E
A
That
is
so
this,
so
the
data
layout
here,
so
what
we
do
we
store
them.
Obviously,
if
you
store
only
the
components
within
this
sphere
for
for
these,
for
this,
what
for
the
cereal
FFTs
they
get
put
onto
the
mesh
and
then
simply
there's
a
simple
3d
FFT
here,
it's
it's
differently
in
well.
The
parallel
FFT
is
is
an
in-house
production
where
what
you
do
its
first
along
one
direction.
So
there
we
do
an
actual
sphere
to
cube
FFT.
A
Yes,
so
some
considerations
with
respect.
So
how
does
this?
So?
What
could
you
conceivably
say
on
the
outset
about
what
would
be
wise
choices
for
for
for
these
parallelization
parameters?
Well,
k?
Part.
If
you
have
K
points,
if
you
have
a
lot
of
K
points
or
a
few
K
points,
and
you
and
you
can
afford
the
memory,
then
that
is
a
very
good
thing
to
use,
always
use
it
to
its
fullest.
A
If
you
have
large
functions
and
if
you
have
a
lot
of
plane,
waves
per
for
one
election
function
definitely
use
this
in
court
to
distribute
the
work
over
over
MPI
ranks,
but
do
not
set
it
to
more
than
than
the
number
of
course,
the
physical
course
that
you
have
on
one
socket
or
on
one
package.
So
what
does
this
mean,
and
normally
or
many
of
the
hardware?
A
Okay,
the
next
generation
heart-
is
going
to
be
slightly
different,
but
many
of
the
Haswell
nodes
a
consists
of
two
sockets
that
have
awareness
a
few
physical
cores
on
each
socket.
So
let's
say
ten
on
socket
one
and
10
and
Osaka,
and
those
have
that
they
have
very
fast
access
to
a
certain
block
of
memory
and
within
the
socket
they
have,
they
have
probably
access
to
to
the
memory
of
the
other
socket.
But
that
takes
slightly
more
time.
A
So,
for
instance,
for
it
is
oh
sorry,
for
instance,
for
these
FFTs
you
wouldn't
want
to
force
the
course
that
reside
on
one
socket
to
have
to
to
read
in
work
with
the
memory
of
the
other,
socket
that
wouldn't
that
wouldn't
be
wise
from
a
performance
point
of
view.
So
you
wouldn't
set
this
anchor
to
anything
larger
than
the
number
of
course
was
so
good.
D
A
C
A
And
Ankara
are
exactly
the
same,
so
well,
they're,
not
exactly
the
same,
but
this
Empire
was
was
chosen
in
an
idiotic
way.
The
definition-
and
it's
it's
almost
impossible
to
describe
it
in
so
nCore,
simply
tells
you.
I
have
n
core
ranks
that
work
on
one
function,
yeah
and
the
other
one
is
sort
of
the
the
inverse
of
it
ever.
You
say
and
part
tells
you
how
many
bands
do
I
do
in
parallel.
So
if,
if
I
have
ten
ranks
and
I
say,
I
do
10
fans
in
parallel
and
part
is
10.
C
A
Ya
know
it's
not
very
prominent
in
our
manual
because
our
menu
was
really
sort
of
lagging
behind
by
quite
a
bit,
and
it's
so
which
is
bad
yeah.
Yes,
so,
and
you
can
use
this
level
of
parallelization
over
plane,
wave
coefficients
in
connection
with
hybrids
functions
now
as
well.
That
didn't
use
to
be
the
case
about
that
has
been
resolved
yes,
so
this
this
is
for
the
for
the
for
the
for
our
current
release
version,
the
version
of
last
that
that
is
parallelized
purely
using
MPI
they're.
A
The
version
that
you
have
actually
been
working
on
on
here
and
that
will
be
available
to
to
our
us
users
at
nurse
is
a
is
a
new
version
that
that
we
are
now
beta
testing
that
uses
a
combination
of
of
OpenMP
and
MPI
and
and
actually,
roughly
speaking,
the
lowest
level
of
parallelization
is
has
been
replaced
by
open
and
people
and
I
said.
So.
In
that
version,
you
can
no
longer
you
shouldn't.
If
you
use
more
than
one
thread,
you
shouldn't
set
anchor
to
anything
else
than
one.
E
E
A
However,
we,
where
we
go
through
these
greens
functions,
we
don't
paralyze
out
for
this
offer
for
electron
quantities.
Yeah
so
I
mean
that
puts
scale
again,
like
n
to
the
power
of
4
right
so
yeah,
because
that
would
mean
that
if
we
would
compute
our
polarizabilities
again
using
these,
these
four
electron
quantities
that
would
we
wouldn't
reach
this
cubic
scaling.
D
D
D
A
F
A
B
F
A
No
no,
this
doesn't.
This
doesn't
affect
your
your
memory
requirements,
so
both
nCore
equals
one
and
anchor
anything
other
than
one
that
doesn't
increase
your
storage
demands
and
that
in
that
sense,
these
two
levels,
so
these
were
added,
are
sort
of
standard
levels
of
parallelization.
Before
this
one
was
added
there,
the
data
is
fully
distributed
over
the
MPI
ranks,
so
that
is
sort
of
what
what
this
tries
to
express
right.
A
B
A
G
A
G
A
The
same
the
same
underlying
idea
is,
is
mentioned
here
right,
so
that
you
that
you
wouldn't
sorry
that
you
wouldn't
increase
this
beyond.
The
number
of
course
were
per
socket.
You
wouldn't
force
communication
between
processes
that
that
do
not
share
as
the
dart
that
are
not
as
close
to
to
a
certain
portion
of
memory
as
other
ones.
Right.
B
A
So
in
the
areas
where,
where
we
use
open
and
P
as
an
additional
level
of
parallelization
there's
a
few
examples
of
this
hybrid
functions
are
examples
of
this.
But
it's
not
not
essentially
mentioned
I
didn't
mention
that
before
so
I
said
it
was
a
replacement
for
a
certain
MPI
layer
of
parallelization.
So
in
the
areas
where
you
replace
MPI
parallelization
with
openmp
parallelization,
you
do
not
save
any
memory
right
because
it
was
it
was
distributed
already.
A
There
are.
We
have
used
openmp
as
well
to
introduce
additional
layers
of
parallelization
and
there,
of
course,
the
nice
thing
is
that
you
do.
First
of
all
you,
you
do
not
have
to
redesign
your
data
structure
because
OpenMP
uses
a
shared
memory
model
and
it
doesn't
for
the
same
reason.
It
doesn't
increase
the
amount
of
memory
that
is
used
so,
but
because
we
we
had
already
quite
a
lot
of
stuff
parallelized
under
under
NPI
right.
A
So
it
is
not
the
prime
argument
in
that
case
right,
because
data
was
already
cleanly
distributed
and
where
it
was
not,
we
have
not
introduced
OpenMP.
So
where
did
we
not
distribute
the
data
that
was
at
the
very
highest
level
in
the
skyPoint
parallelization
and
there
we
don't
rely
on
open
MP
for
an
additional
layer
of
parallelization.
A
The
problem
is
that
what
we
wanted
to
to
avoid
is
have
MPI
communication
inside
of
open
and
P
parallel
regions,
so
we
didn't
want
to
so.
Actually
we
wanted
to
introduce
OpenMP
to
reduce
the
number
of
MPI
ranks
and
not
increase
them
in
the
sense
that
I
now
open
in
a
parallel
region.
I
create
ten
threats
that
all
start
to
communicate
through
open
MPI
or
to
have
a
through
MPI.
So
that
was
sort
of
the
the
idea
why
we
have
put
the
open
and
P
parallelization
at
the
very
lowest
level.
A
If
there's,
if
there
is
a
magic
number
and
no,
no,
unfortunately,
not
because
it
would
depend
on
on
system
size
and
so
because,
because
of
the
fact
that
it
did
it
that
did
it
sorry
that
it
acts
mostly
at
this
level
and
parallelization
all
four
plane
wave
coefficients.
Because
of
this
fact,
if
you
have
a
lot
of
a
lot
of
one
electron
functions
that
that
in
itself
are
very
small,
then
it
wouldn't
bring
a
lot
to
to
use
a
lot
of
threads
here.
A
Neither
would
it
bring
a
lot
to
use
a
lot
of
MPI
ranks
there
right,
there's
always
the.
If
you
have
a
lot
of
data,
you
can
distribute
to
work
out
for
it.
If
you
have
only
a
small
amount
of
data,
be
it
bands
or
coefficients
it,
overhead
will
start
to
hurt
if
you,
if
you
distribute
it
out
for
too
many
processes,
and
that
of
course
depends
on
on
the
particular
calculation
that
you're
doing
how
many
one
electron
functions.
You
have
how
many
coefficients
you
have
per
function.
D
A
I
think
the
the
the
part
of
the
code
that
is
still
almost
in
its
entirety,
mpi
parallelized,
is
the
part
where
we
do
linear
response
to
magnetic
fields,
so
the
NMR
part
I
think
for
the
rest,
almost
all
I
can't
think
of
any
part
that
has
not
been
paralyzed
using
OpenMP
as
well.
I
mean
it's
still.
It
is
still
a
bad
diversion
right,
so
there
will
be
still
work
on
and
it
might
be,
will
be
still
worked
on
and
it
might
change
still.