►
From YouTube: GPU Based Simulations with QMCPack
Description
Ye Luo (Argonne)
GPU Based Simulations with QMCPack
A
Hi
everyone
I'm
mielo
from
Argonne
National,
Lab,
I
work
in
the
computational
science
Division,
and
also
the
leadership
Computing
facility,
I'm
happy
to
share
with
with
you
my
GPU
experience
with
the
qmc
pack.
So
all
the
work
is
to
support
it
by
the
excess
scale.
Computing
project,
so
we
are
Community
pack-
is
part
of
the
application
development
project.
A
All
those
systems
actually
share
some
commonality-
they
are
all
hybrid
means,
GPU
accelerated
architecture.
So
when
we
are
designing
your
keypad,
we
take
this
into
account.
On
the
other
side,
we
don't
want
to
leave
CPU
users
alone,
so
we
try
to
address
all
the
portability
issue
and
make
sure
also
make
sure
the
code
perform
well
foreign.
A
A
That's
why
the
system
size
you
can
solve
in
terms
of
number
of
electrons
is
relatively
low
on
the
other
on
the
right
side,
you'll
notice
that
the
accuracy
kind
of,
because
it's
more
empirical
and
with
less
evolved
with
the
first
principle
of
speculations
and
the
system
size,
can
be
extremely
large
because
of
the
compute
cheap,
computational
cost.
So
qmc
is
actually
in
the
middle
slightly
more
expensive
than
the
density.
Functional
Theory.
A
lot
of
people
use
a
nursery
users
are
familiar
with.
A
You
can
see,
however,
is
cheaper
than
the
typical
quantum
chemistry
method
because
of
the
appealing
scaling
into
the
third
to
the
fourth
depends:
what
problem
you
are
solving
and
another
advantage
of
quantum
Monte
Carlo.
Is
it
scales
very
well
on
massive
number
of
nodes
in
a
super
computer
with
in
the
past,
and
maybe
nowadays,
another
account
actually
drops,
but
the
impact
in
the
past
we
we
could
scale
up
to
like
close
to
1
million
CPU
nodes,
not
even
CPU
cores
you,
you
can
multiply
the
factor
on
top
of
it.
A
So
what
problem
does
qmc
solve
and
we
are
mostly
aiming
at
materials,
so
materials
like
solids
I
did
some
study
in
the
past
so
solids
and
also
we
did
simulations
of
molecules.
This
is
actually
not
a
tiny
molecule.
It's
a
molecular
organic
framework,
it's
kind
of
a
very
humongous
molecule
with
very
complex
structures.
A
So
all
these
things
are
of
interest
for
Science
and
we
want
to
tackle
them
with
qmc
as
we
are
having
moving
from
Pella
scale
to
access
scale
systems,
we
have
more
compute
power,
so
that
means
we
either
solve
the
existing
problem,
faster
that
typical
pet
Escape
problems,
or
we
get
a
chance
to
solve
much
larger
problems
even
10
times
the
number
of
electrons.
So
our
aim
is
like
1,
000
atoms
and
tanky
electrons.
That's
the
overall
effort.
We
are
putting
to
make
those
simulation
happy
for
science.
A
Okay,
so
qmc
pack
implements
Quantum
Monte,
Carlo,
algorithms,
it's
a
modern,
high
performance,
open
source
simulation
code.
All
the
development
is
on
GitHub,
so
you
can
see
all
the
history
or
the
discussion
or
the
issues
and
how
people
are
thinking
and
we
will
come
everyone
to
talk
to
us.
So
the
code
solves
as
I
mentioned,
solids
2D
systems
or
molecules.
It
can
be
used
in
a
wide
variety
of
materials
of
interest
for
both
the
physics
and
chemistry.
A
The
whole
code
is
in
C
plus
plus,
and
we
adopt
MPI
plus
X
scheme
so
X
in
the
beginning,
it's
open
to
all
Cuda
and
right
now,
I
think
it's
even
extended,
so
we
are
kind
of
combining
both.
So
if
you
look
at
the
qmc
these
three
words,
you
could
figure
out
some
patterns
of
parallelism.
First,
it's
Monte
Carlo,
so
you
kind
of
have
massive
change
change.
A
market
of
change
can
be
parallelized
to
kind
of
these.
These
students
very
well
those
super
computers
with
a
lot
of
nodes.
This
provides
the
high
level
concurrency.
A
We
solve
Quantum
problems
at
that
scale.
We
deal
with
the
interactions
between
particles
like
electron.
Electronic
interaction,
electron
ion
interaction,
so
those
in
adds
another
level
of
concurrency.
We
can
use
them
so
historically,
they
will
work
very
well
on
seeing
the
architecture
on
CPUs.
So
apart
from
explicitly
coded
kernels,
the
chemistry
pack
also
heavily
rely
on
the
linear,
algebra
libraries
from
vendors
plus
lab
pack,
some
additional
libraries
we
use
like
hdf5
for
Io
fftw
for
the
initialization,
although
they
are
not
performance,
critical
I
would
say,
but
they
are
very
important
as
well.
A
So
I
explained
you
what
Quantum
gives
you
concurrency
Monte
Carlo
gives
you
concurrency.
So
overall
the
algorithm
evolves
in
this
way.
So
if
you
give
you
schematics
how
the
algorithm
runs,
so
you
start
with
a
bunch
of
electron
configurations
and
you
start
to
evolve
them
so
the
different
configuration
we
call
them
workers
and
as
a
whole,
it's
a
population
so
over
time.
If
the
particles
does
random
walking,
so
they
are
very
independently.
A
So
it's
very
good
for
parallelism
at
the
end,
you
have
to
evaluate
something
decide,
what's
the
weight
of
each
workers
and
you
do
load
by
balance
at
the
end.
So,
overall
the
code
is
very
easily
weak
scaling
except
this
load
balancing.
You
can't
avoid
that
also
intended
to
have
efficiency.
So
now
you
have
a
basic
idea
of
how
the
parallelization
happens
inside
film's
effect.
A
So
now
we
talked
about
close
to
GPU
protein
before
go
to
the
details.
First
is
yeah.
As
I
mentioned,
the
worker
parallelism
is
relatively
very
scalable,
so
in
in
the
past,
from
machines
like
a
mirror
has
a
very
efficient
Network.
We
could
hit
99
of
weak
scaling
and
for
machines.
Titan
ohm
Summit
that
Network
are
from
crane,
it's
typically
less
performant
I
would
say,
but
it's
still
qmc
pack
could
retain
95
percent
of
the
scaling.
So
no
worry
about
cross
node
communication.
A
Now
the
GPU
protein
is
only
focusing
on
a
single
node,
so
there
to
do
GPU
posting.
You
need
to
understand
the
parallelism
in
your
code
and
map
them
to
the
Hardwares.
So
in
community
pack
we
designed
a
way
to
map
workers
to
different
threads,
but
remember
that
actually
we
are
doing
Monte
Carlo,
the
Monte
Carlo,
although
at
high
level
it
looks
like
very
easily
parallelizable
Monte
Carlo
has
another
challenge:
is
its
Divergent
Behavior,
because
you
do
propose
moves.
Some
moves
get
accepted.
A
Some
moves
get
rejected,
so
the
different
workers
doesn't
Advance
at
the
same
pace
and
you
have
to
find
a
way
to
mitigate
those
penalties
by
grouping
all
the
except
together
or
by
grouping
or
reject
together
to
that
kind
of
optimization
at
a
lower
level.
The
electron
electron
interaction,
I
say
it's
data
parallelism.
This
typically
works
very
well,
however,
that
it
easily
hits
the
limitation
of
the
compute
power
of
a
single
CQ
core.
So
we
have
to
develop
ways
to
go
beyond
when
core.
A
So
how,
when
designing
a
performance
code
for
GPU-
and
you
have
to
fact
into
all
the
parallelism
patterns
and
make
bad
MAP
them
properly
to
your
device-
and
you
also
need
to
care
for
about
what
Target
problem
you're
solving.
So
all
those
details
we
write
up
a
paper
that
it's
already
on
timer
will
come
up
soon,
so
you
could
find
how
we
were
designing,
PMC
pack
parallelism
and
how
mapped
all
the
details
like
how
we
achieve
good
efficiency
across
CPU
and
GP.
A
So
the
technology
we
choose
to
offload
to
to
targeting
GPU
is
open
and
to
offload
so
on
the
target
feature.
So
because
this
is
is
supposed
to
be
portable
cross,
vendor,
gpus
and
there's
also
nice
way,
you
can
fall
back
on
CPU,
you
can
offload
to
CPU
and
do
a
lot
of
development
like
correctness
check
those
things.
Those
added
the
benefits
is
very
how
to
developer
friendly.
You
know
so
for
Community
pack
we
actually
track
the
perform,
not
just
performance.
A
We
track
many
compilers,
both
open
source
and
excuses,
both
open
source
ones
and
vendor,
provided
once,
but
it's
a
lengthy
process
to
make
sure
they
improve
the
quality.
If
meet
the
need
of
curing
CPAC
I.
Think
later
you
will
see
I
mostly
use
Clan
15,
which
is
the
best
option
right
now
for
qmc
pack,
but
on
the
other
hand,
like
you,
can
know
it's
also
in
a
good
shape.
It's
at
least
has
all
the
correctness
checks.
A
Aomp
is
good
for
AMD
still
lack
a
bit
of
optimization
we'd
like
to
see
mbhpc
that
we
have
to
work
around
certain
features
that
compiler
refuse
to
provide.
So
when
API
is
very
close
to
Clan
and
still
needs
Improvement
for
a
correctness
check
good,
but
the
performance
tune
is
working
on
it.
So
openmp
is
an
important
component
in
the
software
stack
used
by
qmc
pack
at
a
higher
level.
Community
type
also
rely
on
CPU
threads
talked
to
the
gpus
independently.
That's
actually
add
additional
challenges
for
the
compiler
runtime
developers
make
sure
thread
safety.
A
All
those
things
in
addition
to
that,
as
I
said,
Community
pack
relies
on
linear
algebra,
so
we
heavily
rely
on
who
could
I
have
to
talk
to
the
their
corresponding
linear,
algebra
libraries
Intel
MK
as
well.
So
we
try
to
minimize
the
source
code.
We
have,
and
we
rely
on
C
plus
plus
templates,
to
handle
the
case
of
real
complex
for
Precision
Precision
cases,
and
we
are
very
happy
that
C
plus
plus,
is
evolving
in
very
stable
pace
and
we
now
rely
on
C
plus
plus
17.,
so
some
performance
numbers.
A
Actually
this
is
a
many
years
effort.
I
would
say
each
cluster
of
bars.
You
will
see
on
the
leftmost.
The
green
bar
is
actually
the
CPU
only
case.
Then
we
start
to
port
gpus,
and
you
see
the
light
blue
ones
like
initially
the
performance
really
bad,
because
our
the
1.0
reference
is
qmc
pack
Lexi,
Cuda
implementation,
so
Lexi
GPU
implementation
explicitly
using
encoder
its
performance,
but
it's
not
maintainable
and
certain
design
choice
is
not
flexible
to
be
portable.
That's
why
we
rewrite
the
GPU
code
using
the
openmp
offload
load
plus
Library
approach.
A
Initially,
the
performance
is
way
behind
the
target
we'd
like
to
hit,
but
over
time
as
we
bought
more
and
also
over
time,
we
work
with
Clan
Community
try
to
improve
the
performance
of
the
compiler
and
achieve
a
that
also
helps
us
a
lot
to
pass.
The
performance
check
exceeding
the
coda
in
certain
cases,
but
now
the
code
is
newly
written
code,
is
more
feature
complete,
so
that's
actually
very
friendly
to
users.
The
Legacy
cool
hpu
implementation
with
Cuda
is
not
feature
complete
and
users
frequently
get
stuck.
So
we
are
also
keep
improving
the
performance.
A
I
I
think
we
will
eventually
not
eventually
I
think
we
are
at
the
switching
Point
pushing
users
to
use
the
new
code
right
now.
So
here
is
the
the.
A
So
we
do
did
most
of
the
development
WE
Summit,
but
right
now,
as
nurses
Polaris,
we
have
the
a100
gpus.
So
we
are
all
curious
about
how
the
code
does
with
the
newer
generation
of
gpus.
On
this
figure.
You
will
see.
Actually,
yes,
the
GPU
acceleration
there's,
no
doubt
you
should
use
GPU
and
CPU
only
you
have
a
huge
loss,
but
with
the
latest
with
the
a100
GPU,
there's
an
additional
benefit.
A
It's
not
more
because
of
more
compute
power,
but
also
because
of
the
memory
domain
gets
larger,
so
qmc
pack,
when
you
use
orbitals
like
splines,
it
takes
a
lot
of
memory
space
and
the
remaining
space
need
also
needed
for
all
the
wave
function
proportions
you
store
matrices,
but
when
you
have
a
larger
memory
capacity,
more
data
can
be
resident
on
the
GPU
and
thus
GPU
are
more
busy.
So
there's
a
huge
benefit.
A
Our
a100s,
the
problem
size
goes
large
and
we
we
don't
get
bottlenecked
by
the
memory
like
V100
and
you
can
see
the
largest
point:
the
256
atom
problem.
The
performance
is
close
to
3x
over
B100,
so
a100
is
more
efficient.
I
also
have
large
memory
space
very
beneficial
for
the
simulation,
and
so
that's
what
the
the
advantage
of
aor
hunt.
Somehow
this
pattern
is
a
bit
of
like
the
machine
learning
apps
that
larger
memory
really
helps.
A
Okay,
so
here
here's
the
lessons
learned
so
far,
so
we
carefully
assess
how
CPU
GPU
work
efficiently
and
map
the
code
for
the
existing
concurrency
to
those
levels,
parallelism
and
maximize
the
efficiency
of
the
hardware.
So
we
talked
our
openmd
plus
vendor
Library
strategy,
so
compared
to
all
the
old,
the
Legacy
GPU
implementation
with
Cuda,
so
it
needs
like
100
kernels
explicitly
written
in
Cuda,
but
right
now
we
bring
that
number
down
to
10,
kernels
and
plus
10
off
low
regions.
A
So
when
we
switch
to
a
different
hardware
and
software
stack
like
the
AMD
that
we
need
to
take
care
of
the
Buddha
kernels,
but
all
the
offload
openp
offloads
are
fully
portable,
so
it's
very
maintainable
and
with
very
decent
performance,
I
would
say
so.
The
last
part
I
would
talk
about
my
own
opinion
about
GPU
14.
Hopefully
you
guys
can
get
some
things
right.
A
Insights
so
first
is
I.
Think
GPU
movement
is
the
key
of
14..
So
it's
the
top
priority.
It's
very
similar
to
The
Passage
experience.
When
we
were
in
the
old
days,
we
only
programmed
with
the
MPI.
You
know
that
internet
connect
is
your
bottleneck
and
you
try
to
avoid
moving
data
across
the
interconnect
later
with
CPU.
If
you
look
at
the
details,
you
know
that
the
DDR
is
slower
than
the
cache.
We
try
to
implement
cash
friendly
algorithm
and
make
the
data
resident
in
touch
right
now
for
CPU
GPU
for
GPU
porting
similar.
A
A
Data
locality
is
your
top
priority
when
you
design
your
algorithm
foreign
I
personally,
think
you
should
avoid
any
program
which
ignores
the
performance
of
moving
data
like,
although
some
software
technology
telling
you
how
you
can
rely
on
unifying
memory
and
ask
the
runtime
to
move
data
for
you,
I
always
say
you,
you,
the
better
put
your
effort
to
understand
how
data
should
be
moved,
have
explicit
control.
So
that's
I
can
see
the
remote
most
of
you,
not
software
Engineers.
A
Another
point
is:
if
you
study
or
you
follow
those
GPU
lessons,
they
typically
tell
you
how
to
do
kernel
programming,
but
that's
not
the
key
I
would
say:
that's
really
secondary.
You
should
first
settle
down
how
the
data
been
moved
back
and
forth.
That's
more
of
your
focus.
A
Yeah
second
part,
is
you
need
to
understand
the
parallelism
of
your
code
and
parallelism
inside
the
hardware
you
are
using
and
only
when
you
map
them
very
well,
so
at
least
you
need
two
levels
on
symbols
on
CPU
and
GPU.
If
you
map
them
well,
you
will
likely
get
a
very
portable
code
across
CPN
GPU
and
never
worried
about
further
details
of
the
hardware.
A
I
think
for
scientists.
Cuda
is
not
the
best
choice.
A
A
If
you
could
do
like
my
experience,
you
know
when
I
do
open
blog
flow.
They
spend
most
of
time
just
offload
to
the
CPU
correct,
get
the
numbers
right
then,
on
GPU
machine
to
settle
the
final
fine,
fine
details
about
data
movement,
all
those
things
if
I
made
mistakes
so
yeah,
if
you
really
need
Cuda,
restrict
it
for
serving
Library
calls.
Yes,
Jam,
linear
algebra.
All
those
things
try
to
rely
on
Library.
It's
not
it's,
not
scientists,
job
to
optimize,
Those
portions
leave
so
scientists.
A
A
People
might
struggle
with
to
choose
between
open
ACC
and
opanky
I,
think
their
competitors
and
siblings,
and
there
are
pros
and
cons
and
I
would
say,
assess
your
situation,
your
need,
where
you
run
the
simulation
choose
one
and
that
you
use
that
one
to
kind
of
restructure,
refactor
your
code
to
the
best
shape
and
moving
to
the
other.
One
should
not
be
that
difficult,
although
that
it's
not
not
necessary.
Okay,
you
can
just
switch
pragmas.
That's
a
bit
more
things,
but
they
should
kind
of
bring
our
code
to
the
same
direction.
A
That
is
helpful
for
GPU
party
yeah,
so
those
program
models
both
give
you
the
capability
of
doing
reductions,
well,
cooler
opportunities.
You
have
to
do
it
on
your
own,
so
that's
why
it's
also
more
appealing
to
use
open,
CC
and
open
okay,
so
here
I
told
that
we
designed
qmc
pack
and
we
enable
a
new
performance
portable
implementation,
although
this
is
a
second
time
doing
the
GPU
protein,
but
we
we
learned
a
lot
and
we
find
that
we
are
doing
much
better
compared
to
the
Past.