►
Description
Victor Yu & Marco Govoni
GPU Accleration of the WEST Code for Large-Scale Full-Frequency GW Calculations
A
Oh
hello,
everybody.
First
of
all,
thank
you
very
much
for
having
me
here
so
I
will
be
talking
about
the
GPU
parking
of
the
West
code,
which
is
a
code
for
large
material
simulations
based
on
many
body
per
division
Theory.
A
So
to
put
our
code
development
into
context
in
our
group
we
use
first
principles
simulations
to
study
electronic
structures
for
both
ground
state
and
excited
state
properties,
and
we
are
particularly
interested
in
not
only
small
molecules
but
also
very
large
and
heterogeneous
systems.
This
is
motivated
by
our
Target
applications,
for
example,
nanoparticles
for
energy
harvesting
and
solid
liquid
interface
for
water
splitting,
and
also
spin
defects
in
semiconductors
for
Quantum
Technologies,
so
soon
on.
A
This
page
are
just
some
examples
of
the
nanostructures
and
materials
which
we
would
like
to
simulate
at
the
many
body
per
division,
Theory
level.
So
to
do
that,
we
have
developed
a
code
package
called
West,
which
means
without
empty
States.
So
it
is
a
parallel
implementation
of
metabolic
participation
Theory,
including
GW
and
BSE,
and
but
what
distinguishes
West
from
a
conventional
GW
code
is
that
it
uses
a
formalism
that
does
not
require
any
submission
over
empty
orbitals.
A
So
this
summation
is
quite
expensive
and
it's
avoided
invest
and
it
does
not
require
the
storage
or
the
inversion
of
very
large
matrices
which
are
widely
used
in
conventional
GW
codes.
So
the
CPU
version
of
West
scales
very
well
on
cpu-based
supercomputers.
So,
for
example,
the
largest
GW
calculation
performed
with
must
consists
of
over
2000
electrons.
A
So
the
functionalities
of
the
code
is
summarized
on
this
plot,
and
so
it's
capable
of
computing,
GW
and
electron
bonus
of
energies
and
in
addition,
it
can
simulate
excitation
processes
using
density,
Matrix,
perturbation
Theory,
which
includes
BSE
and
term
dependent
density,
functional
Theory
and
also
it
has
a
Quantum
defect.
Embedding
Theory,
which
targets
strongly
correlated
defect
States
in
semiconductors.
A
So,
finally,
this
code
is
parallel.
It
has
MPI
Plus
on
10p
and
the
CPU
version
scales
to
over
500
000
CPU
cores,
and
last
year
we
ported
the
West
code
to
Nvidia
gpus.
So
it's
tested
and
benchmarked
on
a
number
of
GPU
enabled
supercomputers,
including
promoter
at
nurse
and
also
Summit
at
Oak
Ridge.
So
our
parking
strategies
are
the
following.
So
wherever
applicable,
we
try
to
use
a
window,
optimized
GPU
libraries
for
the
linear,
algebra
operations
like
Bluffs
and
fft,
and
when
four
Loops
that
cannot
be
done
by
existing
libraries.
A
We
use
a
directive
based
approach,
hoping
that
it's
more
portable
than
lower
level
languages.
So
we
started
from
a
Cuda
Fortune
which
are
super
easy
to
write,
but
it's
not
portable
and
then
in
the
latest
release
of
the
code.
We
transitioned
from
good
unfortune
to
open
UCC
and,
as
a
next
step,
we
are
switching
to
openmp,
Target
offload,
which
should
work
on
Nvidia,
AMD
and
Intel
gpus.
A
So
in
terms
of
achievements,
we
are
able
to
get
a
big
speed,
speed
up
by
using
gpus
over
the
CPU,
only
version
which
I'll
show
you
in
a
minute
and
the
code
scales
very
well
on
GPU
super
computers,
including
Summit
and
promoter,
and
data
GPU,
and
one
of
the
largest
JW
calculations
we
have
done
using
the
GPU
versions
actually
consists
of
over
10
000
electrons.
A
So
so
the
code
here
has
multiple
Loops,
so
the
outmost
loop
is
sequential.
It
cannot
be
done
in
parallel,
but
also
inter
inner
Loops
can
be
done
in
parallel.
So
in
the
CPU
version
of
West
we
had
two
levels
of
parallelization,
so
in
the
first
level
we
distribute
the
perturbations
which
are
more
or
less
independent
from
each
other.
A
We
distribute
them
across
images
which
are
nothing
but
subgroups
of
the
MPI
ranks
and
then
in
the
Inner
Loop
we
distribute
the
plane
wave
components
across
MPI
ranks
within
a
image
and
then
a
lecture
here
for
each
image
we
perform
parallel,
Fourier,
transform
and
other
in
algebra
operations
in
parallel.
So
this
works
very
well
on
CPUs,
so
naively
we
were
thinking.
Maybe
we
can
for
each
MPI
rank.
We
offload
the
work
to
a
GPU
which
would
lead
to
a
picture
like
this.
A
So
now
we
are
Distributing
the
plane
wave
coefficients
across
multiple
gpus
and
therefore
we
are
performing
ffts
on
many
gpus,
but
this
does
not
work
because
parallel
fft
is
required
out
to
all
Communications,
which
means
that
the
gpus
have
to
talk
with
each
other.
Quite
frequently
and
as
we
know,
it's
a
computation
on
a
GPU
is
much
more
faster
than
the
communication
between
gpus.
A
So
we
want
to
avoid
data
communication
between
gpus,
so
to
make
it
happen,
we
just
exposed
the
other
two
levels
of
partisan
to
our
code,
so
we
enabled
two
more
polarization
levels.
The
first
one
is
the
loop
over
spin
channels
and
the
second
one
is
a
loop
over
wave
functions
and
by
doing
that,
so
why
does
this
help
is
because
it
basically
it
further
partitions.
The
ampli
runs
within
a
image
into
even
smaller
subgroups
of
rungs,
so
that
each
working
group
becomes
smaller.
A
So
now
so
from
here
is
just
the
example,
so,
instead
of
eight
gpus
work
together
for
parallel
ffts,
now
only
two
gpus
collaborate
on
Parallel,
ffts
and
linear
algebra.
So
by
doing
this
we
can
reduce
the
GPU
GPU
communication,
which
is
quite
costly
and
also
we
can
better
load
balanced
workload
on
each
GPU.
So,
ideally,
we
don't
want
to
split
any
fft
of
operation
over
two
gpus.
So
if
memory
is
not
a
limitation,
we
want
to
do
fft
on
a
single
CPU.
A
That's
how
we
got
the
best
performance,
but
when
ffts
I
mean
when
the
problem
getting
bigger,
they
are
limited
by
device
memory.
Then
we
have
to
split
ffts
across
CPUs,
but
we
use
the
least
number
of
CPU
for
each
fft
and
we
distribute
the
workload
by
using
the
other
levels
of
partisan
in
the
algorithm.
A
So
on
this
page,
I'm
showing
you
the
comparison
of
our
Baseline
GPU
code,
which
is
a
black
bar
on
this
plot
to
the
version
with
all
the
levels
of
polarization
enabled,
which
is
the
Red
Bar.
So
this
is
a
GW
calculation
of
a
silicon
water
interface
with
roughly
a
thousand
and
500
electrons.
So
we
can
see
that
we
are
able
to
get
a
50
performance
Improvement
by
using
more
levels
of
polarization
and
reducing
restricting
the
parallel
ffts
on
one
GPU.
A
So
another
approach
to
speed
up
the
code
is
to
use
non-blocking,
MPI
functions
and
async
GPU
kernels
to
overlap
a
GPU
communication
and
computations.
So
a
example
shown
here
is
somewhere
in
the
code.
We
are
Computing
the
matrix
multiplication
of
two
distributed
matrices
and
the
color
here
just
means
that
the
matrices
are
distributed
on
different
timepi
ranks.
So
our
first
version
is
very
straightforward.
A
So
what
we
do
is
we
copy
the
local
data
from
CPU
to
GPU,
and
then
we
multiply
the
local
block
on
GPU
and
then
we
use
a
MPI
communication
to
against
the
next
block
to
be
multiplied,
and
the
timeline
of
this
approach
is
shown
here.
So
red
means
the
CPU
to
GPU
copy
and
blue
means.
The
GPU,
computation
and
orange
is
the
time
spent
on
MTI.
A
So
then
we
can
optimize
this
code
by
using
a
non-blocking
MPI
communication,
which
means
that
while
the
GPU
is
doing
the
competition
in
the
background,
the
CPU
is
doing
MPI
communication
to
prepare
the
next
block
of
Matrix.
So
by
doing
this
we
can
add
the
cost
of
the
GPU
computation
behind
the
MPI
communication,
and
actually
we
found
that
the
MPI
communication
part
which
in
this
case
is
more
expensive
than
the
computation
itself.
So
the
communication
can
be
further
speed
up
by
communicating
the
data
in
single
Precision.
A
So
what
we
do
is
we
before
NPI
communication,
we
convert
the
data
from
double
Precision
to
single
precision,
and
then
we
do
the
MPI
Communication
in
single
precision,
and
it
turns
out
that
doing
this
doesn't
change
the
physical
observers
we
are
calculating,
but
it
leads
to
a
roughly
a
factor
of
two
speed
up
in
the
MPI
communication.
A
A
Two
and
overall
we
got
1.9
x
compared
to
the
Baseline
and
again
the
GW
energies
are
matching
the
double
Precision
numbers
very
well
and
then
finally,
we
also
did
some
relatively
low
level
optimizations
to
fine-tune
the
GPU
memory,
access
and
also
IO
operations,
and
we
ended
up
with
a
fourth
version
of
the
GPU
code,
which
is
2.2
x
faster
than
our
Baseline.
A
So
next,
let's
look
at
how
the
code
scales
are
super
computers,
so
this
page
shows
a
benchmark
of
a
cadmium,
selenite
nanoparticle,
which
has
about
900
electrons.
So
this
plot
Compares
this
Computing
GW
across
particle
energies
for
this
nanoparticle
on
different
supercomputers.
So,
on
the
right
hand,
side
there
are
the
old
nurse
CPU
machines,
including
the
retired
medicine
and
Corey
as
well.
So
the
code
scales
quite
nicely,
but
on
the
left
left
side
you
can
see
by
using
gpus
it's
much
faster.
A
So
if
we
compare
the
same
number
of
nodes
of
summit,
which
is
the
orange
symbols
and
Corey
as
well,
which
is
the
purple
triangles,
we
got
a
30X
speed
up,
and
so,
as
mentioned
in
the
previous
talks,
switching
from
a
summit
to
promoter
or
in
other
words
from
the
100
gpus
to
a100.
So
with
exactly
the
same
code,
we
got
another
factor
of
two
which
we
think
is
because
a100
has
more
device
memory.
A
A
And
the
scaling
on
the
summit
and
promoter
are
also
quite
nice,
the
both
they
are
close
to
the
ideal,
strong
scaling
indicated
by
the
dash
Knight.
A
So
now,
let's
look
at
a
bigger
Benchmark
on
Summit,
so
this
one
is
the
two
silicon
supercell
models,
one
with
a
thousand
atoms
and
the
other
with
1700
atoms.
A
So
again
we
see
the
strong
scaling
on
Summit
looks
quite
nice,
so
basically,
the
bigger
system
with
1700
atoms
scales
better
because
it
has
more
competition
to
do
and
it's
again
quite
close
to
Ideal
scaling
and
weighs
94
of
the
entire
Summit
machine,
which
means
25,
000
Nvidia
waiver
100
gpus
we
were
able
to
solve
80
cost
particle
energy
levels
of
this
big
silicon
super
sale
in
about
half
an
hour.
A
So
last
I'd
like
to
show
you
some
production
calculations
done
with
the
GPU
version
of
West,
so
showing
on
this
page
are
three
very
large
GW
calculations.
A
large
nanoparticle
weighs
more
than
2
000
electrons
and
a
giant
a
silicon
silicon,
natural
interface
with
over
10,
000
electrons
and
then
a
spin
defect
in
4-H
silicon
carbide
with
6
000
electrons
and
those
calculations
were
done.
A
Using
production
settings
not
some
loose
inaccurate
settings
so
and
we
are
not
only
Computing
a
few
energy
levels,
but
we
computed
thousands
of
cross-particle
energies
to
plot
the
local
density
of
States,
as
shown
here.
So
this
shows
that
the
a
GPU
version
of
Last
can
be
used
to
solve
very
large
problems
that
can
hardly
be
done
on
CPUs
only
so
to
wrap
up.
So
we
have
ported
the
West
code,
the
GW
part
of
West
to
Nvidia
gpus,
and
we
achieved
very
good
performance
and
scalability
on
GPU
accelerated
supercomputers,
including
promoter.
A
So
we
carried
out
some
large
GW
calculations,
of
which
the
largest
one
has
10
000
electrons
running
on
25
000,
Nvidia
gpus,
so
recently
reported
the
quantum
embedding
part
of
West
to
gpus.
So
we
are
still
testing
its
performance
on
promoter.
So
the
initial
results
look
quite
promising
and
we
also
plan
to
Port
the
other
functionalities
like
the
BSE
code
and
electron
funnel
code
to
gpus.
And
finally,
we
want
to
make
sure
the
code
works
not
only
for
NVIDIA
gpus,
but
also
for
AMD
and
internal
ones.
A
So,
to
achieve
this,
we
are
moving
to
from
open
SCC
to
compound
P
Target
offload,
so
we
want
to
mix
code
work
on
the
xscale
machines
like
Aurora
at
argon
and
Frontier
at
Oakridge.
A
So
here
are
the
people
and
organizations
who
helped
us
on
this
project.
In
particular,
we
we
are
part
of
the
needs,
nurse
knee
sub
project,
and
thanks
to
this
project,
we
got
help
from
a
nurse
expertise
and
also
we
got
Early
Access
to
Curry,
GPU
and
formatter,
which
are
invaluable
resources
to
us.