►
From YouTube: Massively Parallel MD Simulations using NAMD
Description
David Hardy (NAMD)
Massively Parallel MD Simulations using NAMD
A
Okay,
so
I'll
talk
about
some
of
our
work
that
has
been
run
on
chromeutter
and
also
some
of
the
development
efforts
that
have
gone
into
namdi
to
utilize
stems
GPU
nodes
effectively.
A
So
the
MD
is
a
scalable
molecular
Dynamics
program,
it's
getting
close
to
30
years
old
now,
and
the
emphasis
in
the
MD
has
traditionally
been
in
parallel
scaling
of
large
systems,
but
I'll
talk
about
the
work
today
that
has
brought
this
back
to
running
very
fast
simulations
on
single
gpus
or
a
single
node
gpudence
architectures.
A
And
so
the
the
first
project
I'd
like
to
highlight
was
a
Gordon
Bell
finalist
in
2021,
with
a
hydatristian,
definite
gorgon
and
the
pi
on
this
was
arvind
ramanathan
from
Argonne
and
This
was
an
interesting
study
about
the
replication
transcription
complex
that
is
responsible
for
replicating
and
transcribing
the
the
viral
mRNA
in
in
inside
a
human
cell,
and
so
this
involved
multiple
levels
of
modeling
where
we
started
with
cryo
em
data,
and
then
we
used
an
intermediate
Continuum
model
called
a
fluctuating
finite
element.
A
Analysis
to
try
and
resolve
these.
These,
these
rather
low
resolution,
cryo-em
images
into
something
that
we
could
then
run
with
all
atom,
molecular
Dynamics,
and
so
in
order
to
drive
this
and
make
it
even
faster.
A
We
used
an
AI
steering,
workflow
and
so
I'm
just
going
to
take
you
through
the
slides
or
the
the
transitions
here,
real
fast
and,
and
the
idea
was
that
we
could
use
this
to
more
quickly,
resolve
images
from
the
ffea
side
and
then
use
these
to
actually
feed
back
the
entire
modeling
process,
and
so
on
the
all
atoms
side.
We
were
using
our
GPU
resonant
version
of
the
MD
that
you
know
we're
released
this
amd3.
A
And
what
was
interesting
about
this
is
that
this
was
actually
a
workflow
that
was
working
to
join
multiple
compute
sites
together,
including
Pro
Mudder,
with
the
Theta
and
Theta
GPU,
and
also
a
special
AI
test
bed
at
Argonne,
okay,
and
so
the
idea
was
to
have
this
asynchronous
workflow.
That
then,
was
was
driven
by
AI
to
steer
the
the
choice
of
simulations
to
to
sample
the
the
more
interesting
and
underrepresented
areas
of
the
the
entire
conformational
space.
A
And
so
when
we
did
this
live,
we
were
actually
using
compute
nodes.
The
promoter,
together
with
compute
nodes
of
theta
GPU,
and
they
both
have
similar
architectures
Pearl,
monitor,
has
four
Envy
linked
a100
gpus
per
node,
a
Theta
GPU
is
actually
assembled
out
of
dgx
a100s,
and
so
those
are
eight
a100
gpus
per
node
and
connected
by
an
Envy
switch.
A
And
so
the
other
interesting
thing
I'd
like
to
call
out
is
the
scaling
performance
here
that
we
got
out
of
the
gpus
versus
what
we
could
get
out
of
a
traditional
cpu-based
system.
A
So
here
this,
this
plot
is
actually
showing
the
crossover
point
in
performance
between
for
for
a
1.1
million
atom
system,
and
we
so
I've
indicated
a
horizontal
line
here
where
you
know
that
shows
the
performance
on
attack
Frontera
at
128
nodes,
and
we
see
that
three
gpus
is
giving
equivalent
performance
to
128
nodes
of
of
a
very
fast
cpu-based
system.
So
so
that's
significant.
A
There's
a
a
lot
of
computational
Power
in
in
these
dense
GPU
supercomputers
that
you
know
we
you
know,
are
continuing
to
develop
our
code
to
to
unlock.
A
So
a
second
project,
that's
been
running
on
promoter
is
being
done
being
pursued
by
Aaron,
Chan
and
manly
Cedar,
and
so
this
is
studying
what
they
call
here.
The
most
abundant
photosynthetic
organism
on
Earth,
the
prochlorococcus,
and
so
here
the
the
idea
is
to
model
the
rate
limiting
steps
in
this,
this
energy
producing
system
and
and
in
order
to
achieve
the
time
scales
that
they
need
to.
A
On
this
they,
you
know
the
the
my
understanding
is
that
the
setup
of
the
the
all
atom
data
here
was
done
with
namdi,
but
then
they
switched
over
to
a
course:
grained
Martini
force
field
representation
and
so
they're
they're
using
chromex,
which
you
know
implements
this.
This
really
well
and
they're
running,
just
a
single
copy
of
this
per
GPU
and
and
so
they've
got
an
ensemble
of
of
these
systems
that
they're
they're
running
together
in
parallel.
A
A
They
can
run
this
at
40
nanoseconds
per
day,
and
you
know
for
for
aggregate.
You
know
time
scales
of
25
microseconds,
so
so
this
is,
you
know
a
lot
of
sampling
that
they
could
do
by
switching
over
to
this.
This
representation.
A
So
now
I'd
like
to
go
into
talk
about
the
development
work
that
we've
done
to
make
namdi
really
fast
on
gpus
and,
of
course,
I'll
start
by
just
talking
introducing
molecular
Dynamics
and
that
we're
integrating
Newton's
equations
of
motion.
So
we
have
to
do
these
steps
sequentially,
but
we
have
a
lot
of
calculation
that
we
need
to
do
to
calculate
the
forces
of
each
stop,
especially
the
non-bonded
forces.
A
And
so,
if
we
look
at
you
know
the
the
parallel
distributed
the
distributed
parallel
workflow
of
the
MD
and
here
take
a
look
at
how
gpus
are
introduced.
A
A
We've
been
incorporating
gpus
since,
since
Cuda
was
first
released,
so
back
in
2007.,
So,
initially
gpus
weren't,
nearly
as
capable
as
they
are
today,
and
we
started
with
just
calculating
the
non-bonded
work
on
on
the
GPU
and
eventually
we
also
ended
up
calculating
the
bonded
computes
and
also
the
scalable
parts
of
the
pme
calculation,
the
charge
spreading
and
the
force
interpolation
kernels.
A
For
that,
and
in
addition,
we've
got,
you
know
the
mechanism
and
the
AMD
so
that,
if
you're
only
running
on
a
single
GPU
or
a
single
node
that
we
can
calculate
the
entire,
you
know
work
for
pme
on
on
a
single
GPU.
A
But
you
know
in
when
we're
looking
at
the
parallel
calculation.
You
know
the
the
original
idea
was
to
use
the
GPU
offload
scheme
and
we're
partitioning
work
between
the
CPU
and
the
GPU.
Where
you
know
the
force,
calculations
are
being
done
on
the
GPU
and
then
the
the
remaining
parts
on
the
CPU
include
the
integrator,
rigid,
Bond
constraints
and
possibly
whatever
you
know,
enhanced
sampling
methods.
A
You
know
that
you
might
be
using,
and
this
approach
worked
really
well
up
until
somewhere
around
the
release
of
the
Pascal
generation
and
the
voltage
generation
gpus.
When
we
found
that
namdi's
GPU
offload
approach
was
being
increasingly
CPU
bound,
the
the
CPU,
the
work
remaining
on
the
CPU
was
becoming
a
bottleneck
now
schematically.
A
The
idea
here
is
that
the
GPU
offload
kernel,
you're,
launching
a
force
kernel
here
to
calculate
the
forces,
and
then
we
had
a
mechanism
in
namdi
where
we
could
write
back
real,
fast
to
host,
pin
memory,
and
meanwhile
the
CPU
is
they're
busy
waiting
for
to
see
if,
if
forces
are
done
and
as
soon
as
forces
are
done,
it
can
start
into
then
integrating
the
next
time
step
with
those
forces.
A
So
that's
how
we
were
getting
overlap
between
the
CPU
and
the
GPU
and
as
gpus
became
faster
and
more
capable
again
we're.
We
were
seeing
significant
gaps
in
how
we,
how
much
utilization
of
of
the
GPU
the
effective
utilization
of
the
gdpu
that
we
were
getting,
and
so
you
know,
then
our
approach
was
to
develop
a
GPU
resident
version
of
PMD.
A
Where
now
the
the
atomic
coordinate
data
all
lives
on
the
GPU
between
time,
steps
and
you're
basically
have
moved
all
of
this
atom
integration,
work
and
all
the
related
stuff
onto
the
GPU,
and
you
end
up
having
very
little
CPU
work
being
done
at
this
point.
A
So
we
can
see
here
now
in
in
the
profiles
after
the
GPU
resident
version
that
we're
using
the
gpus
quite
effectively
and
our
timings
bore
this
out,
where
it
was
in
some
cases
more
than
doubling
the
performance
of
the
GPU
offload
version.
A
A
You
also
have
that
to
some
extent
Summit,
and
then
you
know
it
also
in
in
the
Frontier
computer,
and
so
here
the
idea
was
that
you
know
we're
now
decomposing
the
entire
problem
between
you
know
several
gpus
and
and
but
you
know,
there's
going
to
be
some
communication
required
between
the
gpus
and
so
what
we
have
to
do
is
then,
within
each
time
step
is
we're
going
to
have
to
do
a
position
multicast
to
populate
these
compute
objects
that
are
then
being
calculated
on
on
their
respective
gpus
and
then
Force
reductions
back
to
you
know
the
gpus
that
are
holding
the
patches,
and
so
this
means
the
gpus
need
load,
store
memory
access
between
these
different
devices
within
every
time,
step.
A
Okay,
so
that's
why
we
need
a
system
with,
with
these
fast
interview,
link
connections
between
the
gpus
and
since
the
original
version
of
this
we've
actually
done
some
work
to
improve
the
performance
and
scaling
where
we've
done
some
things
to
mitigate
some
of
the
work.
That's
still
left
on
the
CPU
that
includes
this
so-called
at
a
migration
step.
That's
updating
the
domain
decomposition.
A
Also,
we
find
that
pme,
you
know,
can
cause
quite
a
scaling
bottleneck
as
well,
and
so
we
have
some
some
things
in
place
to
mitigate
that
that
those
issues
from
pme
and
so
here's
a
performance
plot
now
with
a
a
one
million
atom
Benchmark
system
that
we,
you
know
regularly
use
to.
You,
know
benchmark
performance
of
the
MD
and,
and
here
we're
showing
that
you
know
the
versus
the
original
version
were
scaling.
A
Quite
a
bit
better,
for
you
know,
say
like
a
full
dgx
a100
again
promoter
is
effectively.
You
know,
half
the
a
note
of
promoters,
effectively:
half
the
the
computational
power
of
of
dgx
a100,
because
there's
there's
just
the
four
nodes
and-
and
we
were
already
scaling
quite
well
out
to
four
nodes,
but
now
we've
these
these
optimizations
help
to
improve
performance
as
well
and
so
just
to
take
a
look
at
a
larger
size
system.
A
So
this
was
a
a
SARS
cov2
coronavirus,
Spike
protein
system
here
of
eight
and
a
half
million
atoms,
and
we
see
that
we're
doing
quite
well
and
and
the
the
new
code
is
faster
than
you
know,
running
this
on
on
64
nodes
of
a
frontier
out
here
in
this.
For
this
particular
simulation.
A
A
That
has
a
very
close
correspondence
to
Cuda,
and
so
it
turns
out
that
you
know
it
was
a
little
bit
of
work,
but
you
know
we
can
basically
have
a
hipified
version
of
our
code
based
on
our
original
Cuda
kernels,
with
just
a
little
bit
of
extra
pre-processing
macro
magic
and
but
otherwise
we
can
leave
the
the
code
path
generally
unchanged
now
for
for
the
for
Aurora.
A
On
the
other
hand,
this
this
is
going
to
be
based
on
Intel
gpus,
and
so
our
approach
here
is
to
implement
this
in
sickle,
and
we
do
have
some
automatic
translation
tools
that
can
you
know,
take
our
Cuda
kernels
and
turn
them
into
sickle
code.
But
it's
not
something
that
we
can
really
support
right
now
on
the
the
same
GPU
code
paths.
So
so
this
is
really
effectively
giving
us
a
yet
another
GPU
accelerated
code
path
through
named.
A
You
know,
that's
that's
something
that
you
know
we're
going
to
have
to.
You
know
deal
with
eventually,
but
anyway,
that's
it's
been
a
longer
process.
It
hasn't
been
as
easy
to
Port
an
amd2
aurora.
A
At
the
moment
we
do
have
all
of
our
GPU
offload
kernels
ported,
and
we
still
need
to
Port
the
GPU
resident
parts
of
namdi,
okay,
so
I'd
like
to
end
by
acknowledging
particularly
the
the
various
you
know,
people
that
we've
either
had
here
at
the
University
or
our
you
know,
Partners
at
you
know
different
companies
that
have
had
their
hands
on
the
GPU
parts
of
named.