►
From YouTube: GW Calculations at Scale
Description
Charlene Yang of NERSC presents a talk on GW Calculations at Scale. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Michael Rowan.
A
I'll
be
talking
about
some
of
the
work
we
did
for
berkeley
gw,
which
is
one
of
the
new
sap
applications,
and
I
have
been
pretty
lucky
to
be
part
of
this
team
and
all
the
work
we
have
done
so
far
has
won
us
a
finalists
place
in
the
gordon
bell
nomination
this
year.
So
hopefully
this
is
gonna,
be
interesting.
Work
to
you.
A
A
So
what
is
gw
g
stands
for
green's
function,
w
for
screened,
column,
interaction
and
gw
calculations
usually
sit
on
top
of
some
of
the
other
chemistry
or
material
science
codes
like
quantum
expresso
abby
niche.
A
So
these
applications
calculate
you
know
some
of
the
ground
stage
properties
and
then
the
results
get
fed
into
some
of
the
gw
codes
like
berkeley,
gw
and
recall
gw.
Would
you
know
kind
of
refine
on
top
of
those
results
and
get
a
more
accurate
estimation,
in
terms
of
say,
the
self-energy
or
some
other
properties,
and
these
gwu
calculations
help
understand.
A
You
know
some
of
these
questions
listed
here
like
what
happens
when
you,
when
you
add,
or
remove
an
electron
from
a
system.
How
do
electrons
behave
when
you
apply
a
voltage
or
how
does
the
system
respond
to
light
or
x-rays?
A
So
these
are
very
important
questions
in
material
science
and
are
very
important
for
energy.
Related
device
designs,
a
bit
about
berkeley
gw.
So
there
are
four
different
modules
in
berkeley:
gw,
absalom,
sigma
kernel
and
absorption.
If
you
compile
berkeley,
gw
you'll
get
four
different
executables
and
these
executables
can
be
run
in
log
step.
So
you
know
the
output
of
epsilon
can
be
fed
into
sigma
to
calculate
the
self
energy,
and
then
the
output
of
sigma
can
be
fed
into
kernel
to
calculate
something
else.
A
So
it
just
goes
on
like
this,
but
today
I'm
going
to
just
focus
on
the
you
know
the
gw
based
calculations,
which
is
epsilon
and
sigma.
The
other
two
are
more
on
the
base.
Salpido
equation
based
calculations,
so
some
of
the
computational
motifs
for
broccoli
gw
are
matrix.
A
Multiplications
three
hair
transforms
large
reductions,
eigenvalue
problems,
matrix
inversions,
but
in
epsilon
and
sigma
we're
mostly
dealing
with
the
first
three
and
when
I
say
large
scale,
some
of
the
matrices
have
hundreds
of
the
thousands
of
the
rows
and
millions
of
columns.
So
we're
really
dealing
with
you
know
large
mattresses
and
large
calculations.
A
A
So
epsilon
has
a
aquatic
scaling
in
terms
of
computation
and
cubic
for
memory.
It's
a
bit
lower
for
sigma.
I
talked
about
this
because
this
helps
us
understand.
You
know
what
the
bolt
on
x
could
be
for
these
kernels
or
for
this
these
modules.
A
So
if
I,
if
I
speak
from
a
roofline,
roofline
performance
model
point
of
view,
these
scaling
properties
would
help
me
understand
whether
the
the
kernel
is
compute
bound
or
memory
bandwidth,
bounds
or
communication
bounds,
or
you
know
even
how
much
physical
memory
we
need,
because
we
we
could
be
limited
by
how
much
physical
memory
we
have
on
the
gpu
or
even
on
the
host
so
and
understanding
it
understanding.
This
is
very
important.
At
least
it
has
been
felt
helpful
for
this
work
here.
A
So
the
goal
of
this
newsapp
project
is
to
have
a
gpu
port,
an
efficient
one.
For
for
the
code,
we
started
with
a
pretty
efficient
cpu
implementation,
which
is
parallelized
with
mpi
and
op
mp
and
is
scaled
pretty
well
up
to
having
12
pedal
flop
pedaflops
on
corey.
A
So,
given
our
understanding
of
this
code,
given
you
know
all
the
jam-like
calculations,
the
dense
linear
algebra,
we
have,
we
believe
that
you
know
gpus
could
really
help
us
speed
up
even
more
so.
The
approaches
we
took
were
really
two
programming
models.
One
is
cuda
c
plus
plus
and
the
other
acc.
A
So
the
the
reason
behind
you
know
this
choice
is
that
you
know
we
would
like
to
prototype
some
ideas
pretty
quickly
using
opengcc,
because
it's
a
directive
based
language.
So
it's
very
easy
to
you
know
code,
something
up
if
you
have
some
idea
and
because
the
the
code
is
written
in
fortran,
if
we
write
everything
in
cuda,
you
know
there
could
be
a
lot
of
interfacing,
even
though
you
know
include
a
branch.
We
did
end
up.
A
You
know
doing
all
of
this,
but
for
the
the
choice
of
cuda,
the
c
plus
plus
version
for
some
kernels
is
that
we
were
hoping
to
kind
of
fine-tune
some
of
the
kernels
in
places
where
you
know
open.
Acc
is
not
able
to
so
that
was
the
kind
of
the
rationale
behind
that
and
because
we
have
so
many
you
know
jam
lapec
fft
operations.
A
A
Some
of
the
techniques
we
used
to
optimize
this
this
code,
to
make
it
efficient,
include
the
long
non-blocking
cyclic
cyclic
communication
scheme,
also
the
the
use
of
cuda
streams
and
a
badged
operation
or
a
batching
mechanism.
A
I
I
will
not
go
into
the
details,
but
I
will
touch
on
you
know
the
communication
scheme
a
bit
and
also
some
optimizations.
We
did
for
sigma
gpp,
but
before
going
into
that
this
this
table
here
with
this
table,
I
would
like
to
show
you
just
you
know
what
kind
of
large
scale
we're
talking
about.
A
So
the
benchmarks
we
use
are
some
of
the
silicon
silicon
carbide
supercells,
the
largest
one,
has
42
atoms
and
that's
probably
10
000
electrons
and
some
of
the
parameters
you
can
see.
Cell
coating
in
red
are
really
mind-bogglingly
large.
So
to
you
know,
if
we
were
not
able
to
scale
this
up
very
efficiently,
then
you
know
the
runtime
would
be
kind
of
an
imaginable,
and
so,
with
this
optimization
with
this
implementation,
I'll
show
you
later
that
we
have
actually
managed
to.
A
A
So
the
communications
scheme
here
we're
talking
about
you,
know
really
large
scale,
matrix
multiplications.
A
If
we
have
a
matrix
m
like
this
kind
of
fat
and
short
and
multiply
it
by
as
transpose
or
conjugate,
we
were
trying
to
get
this
small
matrix
and
to
to,
for
example,
if
we
have
four
ranks
we
would
have
four
different
copies
of
this
small
quarter
of
this
chi
matrix
and
each
rank
would
be
calculating
a
copy,
and
then
we
kind
of
accumulate
all
the
copies
together
to
to
get
this
final
copy.
A
So
the
conventional
way
of
doing
this
is,
you
know,
really
based
on
mpi
collectives,
so
each
rank
is
calculating
its
own
copy
and
then
we
reduce
to
one
copy
after
you
know,
the
calculation
is
done
and
it
goes
on
until
we
get
to.
You
know
the
last
part
last
portion
of
the
kai
matrix.
A
A
A
The
computation
there's
still
roughly
the
same
amount
of
blocks
like
this
in
the
in
this
new
scheme,
but
if
we
look
at
you
know
a
larger
scale,
this
is
a
really
point
to
point
based
communication,
whereas
this
is
you
know,
reduction
and
collective
based
and
how
this
reduction
is
implemented,
whether
it's
sufficient,
it
could
be
another
question,
but
with
with
this
non-blocking
point
point
communication
scheme
we
were
able
to
you
know
reduce
the
amount
of
communication.
We
do
and
hide
this
communication
behind
the
computation.
A
The
pattern
of
how
these
different
copies
move
around
the
network
is
a
bit
different
than
in
the
other
scheme,
but
well.
This
is
why
we
call
it
a
cyclic
communication
scheme,
so,
for
example,
for
this
particular
quarter
of
kai
rank
1
would
be
calculating
a
copy,
the
blue
copy,
and
this
would
be
merged
with
you
know
this
green
copy
as
it
moves
along
the
network
and
then
it
goes
on
and
then
back
to
rank
zero.
A
So
here
rank
zero
would
have
the
final
results
for
this
quarter,
and
so
this
has
proven
very
helpful
for
the
performance,
especially
at
a
lower
and
at
a
smaller
and
medium
small
and
medium
scale,
at
an
extreme,
extremely
large
scale.
A
We
didn't
notice
that
the
overlapping
between
the
you
know
the
communication
and
computation
may
not
be
as
effective
as
we
thought
it
would
be,
or
at
least
as
it
was
for
the
the
lower
scale,
but
for
the
most
part
it
has
been
pretty
effective
and
has
been
providing
a
lot
of
performance
improvements
for
the
higher
for
the
larger
scale
we
are
still
investigating.
A
You
know
what
other
things
we
could
do
in
order
to,
you
know
make
sure
the
the
hiding
is
still
effective
anyway
I'll.
I
will
show
you
some
results
later,
which
has
the
scaling
curve,
and
you
can
see
you
know
the
difference
as
the
scale
changes.
A
So
the
other
point
I'd
like
to
touch
on
is
the
reductions
in
sigma
gpp.
So
what
this
kernel
is
doing
is
basically
this
calculation
here
the
different
circles
represent
different
dimensions
that
we
have
to
collapse
over.
A
So
all
these
matrices
are,
you
know,
very
light
at
a
very
large
scale
and,
in
the
end
of
this
kernel,
we're
right
we're
really
trying
to
get
a
a
very
small
array,
sometimes
a
three
by
one
array,
charlie.
I
think
our
coming
up
close
the
end
of
time.
We
have
about
one
minute.
A
All
right,
so
some
of
the
optimizations
we
did
include
you
know
just
moving
the
the
kernels
from
a
bandwidth
bound
region
to
a
computer
bound
region
and
replacing
some
of
the
instructions,
also
removing
some
of
the
excessive
branching.
So
we
will
talk
about
this
in
the
upcoming
hackathon
in
more
detail.
If
you're
interested,
you
can
take
a
look,
but
in
terms
of
you
know,
results
we
have.
You
know
a
very
good
speed
up
compared
to
the
cpu
implementation
for
both
epsilon
and
sigma.
A
A
Up
to
you
know,
20
000
gpus,
both
epsilon
and
sigma
scale.
Very
well,
it's
just
the
last
few
points
which
we're
still
investigating,
but
overall
it's
it's
been
a
pretty.
You
know
pretty
impressive
work
and
I
would
say
you
know
kudos
to
the
whole
team
for
getting
all
of
this
done
and
with
that
I
like
to
stop
and
thanks
to
acknowledge
all
the
resources
we
have.
We
have
used
thanks.