►
From YouTube: Intro to Perlmutter and GPUs
Description
Intro to Perlmutter and GPUs
Presenter: Jack Deslippe, Application Performance Group Lead, NERSC
Training: Migrating from Cori to Perlmutter, March 10 2023
A
Thank
thanks.
Everybody
for
joining
I'm,
Jack
deslip
I
lead
the
application,
Performance
Group
here
at
nurse
and
I'm,
going
to
kind
of
give
you
an
overview
of
the
pro
Mudder
system
itself
and
some
of
the
capabilities
of
the
GPU
and
some
of
what
we've
been
doing
to
work
with
with
users
and
and
what
we've
learned
from
that
from
that
process.
Just
to
kind
of
give
you
a
little
bit
of
a
kickoff
to
the
day
on
the
potential
of
pearl
minor,
so
I
want
to
start
with
this
overview
slide
of.
A
Where
can
a
pro
Mudder
sits
in
our
in
our
road
map?
So
back
in
2013,
we
deployed
what
is
not
kind
of
like
the
last
of
maybe
like
a
traditional
or
like
what
was
a
multi-decade
kind
of
trend
of
HPC
systems
that
were
dominated
by
CPUs
like
server
server
class
CPUs
and
a
distributed
kind
of
massively
parallel
system
and
I.
A
We
procured
for
the
first
time
kind
of
like
a
novel
architecture
based
on
the
Intel
Xeon
Phi
technology
and
with
pearl
Mudder,
we're
kind
of
continuing
this
transition
with
our
first
ever
GPU
accelerated
system
at
nurse,
and
this
is
part
of
the
road
map
that
should
lead
to
our
first
sort
of
exascale
class
system
in
the
2025-2026
time
frame.
A
So
here
is
kind
of
a
picture
of
the
pro
Mudder
system
configuration.
There
are
two
types
of
of
nodes
in
the
system:
actually
they're
kind
of
like
three
types
of
nodes
in
the
system,
but
two
big
categories.
So
one
are
the
GPU
accelerated
nodes
and
then
there
are,
in
addition,
still
some
CPU
CPU
only
only
nodes
as
well.
So
a
lot
of
the
capability
of
the
system
comes
from
these
GPU
accelerated
nodes,
the
Nvidia
ampere
GPU
GPU
nodes.
A
Each
one
has
four
gpus
in
it,
and
one
of
the
AMD
Milan
CPUs
each
GPU
has
40
gigabytes
of
of
hbm
and,
in
addition,
there's
a
traditional
dram
on
the
system
and
each
one
has
four
connections
to
the
interconnect.
So
four
four
individual
network,
interface
cards
or
or
Nicks,
and
then
for
the
CPU
only
nodes.
A
We
have
two
CPUs
per
node
with
one
one
connection
to
the
interconnect
on
each
one
of
those
each
one
of
those
nodes
and
so
I'm
gonna
tell
you
a
little
bit
more
details
of
of
these
as
we
as
we
go
through
the
slides
here.
So
overall,
the
system
has
the
following
specifications.
So
there's
the
GPU
nodes
and
one
of
the
things
I
kind
of
skipped
on
the
last
slide
is
that
there's
actually
two
different
types.
A
So
the
majority
of
those
have
the
40
gigabytes
of
HPM
per
GPU,
but
We
additionally
have
256
nodes
with
with
a
sort
of
a
little
bit
of
a
higher
skew
on
the
GPU
that
has
80
gigabytes
of
high
bandwidth
memory
or
hbm
per
per
GPU,
as
well
as
the
3000
CPU
CPU
only
nodes
in
terms
of
the
performance.
You
can
see
quite
a
lot
of
the
performance
comes
from
the
GPU
GPU
nodes
in
terms
of
the
actual
like
available
flops
or
floating
Point
instructions
on
the
on
the
system
and
the
CPU.
A
The
CPU.
Only
partition,
though
one
of
the
things
I
will
highlight,
is
that
it's
a
you
know
close
in
capability
to
that
of
of
Corey.
So
all
of
the
Quarry
system
is
similar
capability
to
the
to
the
CPU
nodes
of
of
promoter.
A
So
this
is
kind
of
a
diagram
of
what
I
was
describing
earlier.
So
here's
what
a
GPU
node
looks
like
you:
have
these
four,
a
100
gpus
one
AMD
Milan
CPU,
and
it
is
connected
to
four
network
interface
cards
or
for
Nix
that
connect
it
to
the
overall
Pro
Mudder
network.
A
One
thing
to
also
highlight
is
that
the
a100s
themselves
are
connected
via
NV
link
to
each
other,
so
there's
a
very
high
speed
Network
between
the
four
gpus
on
the
on
the
note
that
is
described
here,
the
gpus
themselves
are
connected
to
the
to
the
node
or
to
the
to
the
CPU
via
a
PCI
Express
for
connection
and
one
of
the
things
that
of
course
makes
the
gpus
very
capable
is
their
high
bandwidth
memory.
So
each
of
these
cards
has
at
least
40
gigabytes
of
hybrid,
with
memory.
A
I
talked
about
the
fact
that
we
have
256
256
additional
nodes
with
80
gigabytes
of
high
bandwidth
memory
and
that's
called
high
bandwidth
memory,
because
the
bandwidth
between
the
memory
and
like
the
the
GPU
registers
and
compute
units
is
very
high.
It's
over
1500
gigabytes,
a
second
which
I'm
highlighting
so
right
here
and
that's
compared
to
the
kind
of
similar
bandwidth
on
the
CPU
of
about
200
gigabytes
per
second
of
CPU
CPU
memory,
all
right.
A
So
each
of
these
gpus
is
a
very
capable
processing
unit,
and
so
it's
capable
of
up
to
about
10
10
teraflops,
each
it
at
the
kind
of
64-bit
or
double
double
Precision
level.
A
But
if
you
actually
are
able
to
use
the
tensor
cores,
which
is
like
Matrix,
Matrix
type,
both
multiply
type
operations,
you
can
get
about
double
the
performance
out
of
the
out
of
the
a100
gpus,
okay
and
then
over
here,
I
have
a
diagram
of
the
CPU
nodes
where
we
have
two
AMD
Milan
processors
per
node,
there's
64
cores
per
CPU.
A
So
each
of
these
AMD
melons
has
64
cores,
which
is
convenient
because
it's
kind
of
similar
in
spirit
to
the
k,
l,
the
the
k,
l
nodes
that
we
have
on
Corey.
A
It
supports
up
to
the
AVX
2
instruction
set.
So
this
is
actually
about
half
it's
half
the
vector
width
of
the
the
K
L's
on
Quarry,
but
it
has
a
very
capable
CPU
cores,
quite
quite
a
lot
faster
than
each
of
the
the
k
l
cores
that
you'd
find
on
Corey.
In
addition,
as
I
said
it
Sports
up
to
about
200
gigabytes
per
second
of
memory,
bandwidth-
and
you
can
see
a
few
other
stats
for
each
of
the
the
CPUs
below.
A
One
of
the
exciting
aspects
I
think
about
Pro
Mudder,
is
the
fact
that
we
have
this
old
flash
file
system
on
the
system.
So
the
scratch
file
system
is
all
Flash.
It's
about
35,
petabytes
of
disk
space.
A
It
has
an
egg
bandwidth
of
over
five
terabytes
a
second
and
because
it's
flash,
it's
really
performing
well
in
terms
of
iops,
so
4
million
iops
per
per
second-
and
you
know
this
at
the
end
of
the
day,
underneath
the
file
Sim
system
are
these
about
4
000,
nvme
ssds
that
are
powering
the
performance.
So,
unlike
Corey,
where
you
have
a
kind
of
a
disk
space
scratch
and
then
a
burst
buffer
layer
that
is
Flash
based
here
at
promoter,
the
kind
of
story
is
a
bit
simplified.
A
You
just
have
one
scratch
file
system
and
it's
all
all
flash,
and
so
one
of
the
things
that
I
want
to
kind
of
chat
about
today,
with
all
of
you
and
kind
of
present,
is
that
we
really
kind
of
have
this
common
challenge
together
with
with
you
all
in
the
user
Community,
which
is
to
enable
this
diverse
community
of
scientific
codes
to
you,
know,
run
efficiently
on
an
advanced
architecture
like
Pro
Mudder,
starting
kind
of
with
the
Corey
transition
and
Beyond.
A
As
we
look
towards
exascale,
and
you
can
see
some
a
little
bit
of
the
difference
between
Pro,
mutter
and
Corey,
as
we
continue
this
transition
in
this
table
here.
A
So
obviously,
the
Peak
Performance
of
the
system
has
gone
up
quite
quite
a
bit
and
we're
looking
at
increased
capabilities
in
a
number
of
different,
different
Avenues.
So,
of
course,
the
memory
the
overall
system
memory
has
gone
up,
but
one
of
the
most
significant
differences
is
this
node
performance
number,
which
has
gone
up
quite
significantly,
due
largely
to
the
fact
that
we
have
these
GPU
powered
nodes
with
the
a100
gpus,
and
one
of
the
things
that
we
looked
at
as
we
sort
of
began.
A
This
process
of
deploying
promoter
is
what
fraction
of
the
workload
is
really
kind
of
ready
for
for
gpus,
and
so
this
was
the
situation
kind
of
back
in
the
2017-2018
time
frame,
where
we
looked
at
the
the
major
codes
at
nurse
that
were
using
the
most
hours
and
and
sort
of
how
ready
they
were
for
for
gpus.
And
so
this
was
the
categorization
that
we
used,
and
this
is
one
of
the
reasons
why
we
ended
up
with
a
system
that
has
both
GPU
accelerated
nodes,
but
also
some
CPU
CPU.
A
Only
nodes
is
because
we
found
that,
while
a
large
fraction
of
our
workload
had
been
enabled
or
could
be
enabled
you
know,
parts
of
the
workload
were
were
also
not
yet
optimized
for
for
gpus,
and
so
that
kind
of
motivated
us
to
start
a
an
effort
to
really
help
our
our
users
and
partner
with
the
user
Community
to
increase
the
number
of
applications
that
were
optimized
and
enabled
for
the
for
the
gpus.
And
so
what
I'm
going
to
tell?
A
You
is
a
little
bit
about
some
of
what
we
did
there
and
what
we've
learned
and
what
we
hope
we
can
continue
doing
with
with
you
all
so
we
started
this
program
called
nesap,
which
stands
for
the
nurse
exascale
science
application
program,
and
you
know
part
of
the
motivation-
was
that
there
is
some
significant
work
or
differences
that
have
to
be
taken
into
account,
as
you
consider
optimizing,
your
application
for
a
GPU
I'm,
going
to
talk
a
little
bit
about
a
few
of
those
differences
over
the
next
few
slides.
A
So
one
of
the
biggest
differences
is
just
the
amount
of
parallelism,
that's
required.
So
if
you
look
at
a
typical
CPU,
CPU
node
from
Corey,
using
that
the
Intel
Haswell
architecture,
you
can
see
that
there
are
64
cores
per
node
with
the
Intel
hyper
threading
technology,
you
can
sort
of
have
two
threads
really
active
at
each
time.
A
It
has
an
avx2
instruction
set.
So
so
at
the
at
the
sort
of
CPU
level,
you
can
have
two
by
256
bit
vectors.
So
this
is
sort
of
four
wide
if
you're,
if
you're
talking
about
double
precision
numbers-
and
so
if
you
do
all
the
math
here,
you
can
end
up
with
2
000
way.
Parallelism!
That's
really
required
to
keep
one
of
those
CPU
nodes
on
on
Corey
kind
of
fully
occupied
a
fully
busy
every
every
cycle.
A
If
you
compare
this
with
the
gpus
or
the
a100s
on
Pro
Mudder,
you
kind
of
have
the
equivalent
of
a
108
SM.
So
SN
stands
for
streaming,
multiprocessor,
which
is
not
quite
the
same,
but
we
can
maybe
make
like
a
rough
comparison
to
like
a
core
on
on
Corey.
A
Each
of
those
SMS
can
support
up
to
64
warps
at
a
time.
So
two
are
active
at
a
time,
but
you
really
want
to
kind
of
over
subscribe
things
to
keep
the
GPU
busy.
So
you
can
have
up
to
64.,
and
one
of
the
biggest
differences
is
that
each
one
of
those
warps
really
works
on
32
70
Lanes
at
a
time,
and
so,
if
you
do
the
math,
you
end
up
with
108
times
64
times
32.
So
that's
200
thousand
way
parallelism.
A
So
that's
a
big
leap
from
from
the
2000
way
parallelism
of
the
of
the
CPU
that
we're
talking
about
on
Haswell
and
yeah.
So
this
this
bullet
just
described.
What
I
was
saying
in
words
is
that
you
typically
want
to
oversubscribe
the
GPU
to
keep
to
keep
it
busy,
and
so
that's
one
of
the
reasons
why
you
end
up
with
the
increased
parallelism.
A
So
it
did
another
concept.
That's
really
important!
For
the
for
the
to
understand
the
performance
of
the
GPU
is
around
memory
bandwidth.
So
on
the
kind
of
Corey
Haswell
nodes.
You
have
128
gigabytes
of
this
D
of
kind
of
traditional
DDR
dram
and
it
is
capable
of
128
gigabytes
a
second,
so
that
basically
means
it
can
bring
in
100
gigabytes
every
second
of
data
from
memory
into
the
CPU
or
any
sort
of
into
the
registers
of
the
CPU
to
do
computing
on
the
GPU
nodes
of
pearl
Mudder.
A
So
we're
considering
a
single
a100
GPU.
Here
we
have
40
gigabytes
of
HPM
on
most
of
the
nodes.
Some
of
the
nodes
do
have
80
gigabytes,
and
you
see
that
you
have
1500
gigabytes
per
second
of
memory,
bandwidth
so
or
over
an
order
of
magnitude
higher
than
those
Haswell
CPUs
on
on
Corey,
and
so
that
gives
you
a
lot
of
capability.
But
one
of
the
complications
is
that
the
connection
between
the
CPU
and
the
GPU
is
relatively
slow.
A
So
that's
powered
by
this
PCI
Express
bus
here
and
you
can
see
that's
about
32
gigabytes,
a
second
so
much
much
smaller
than
the
bandwidth
that's
available
from
G
from
the
GPU
memories
to
the
GPU,
compute
node
units,
and
so
one
of
the
challenges
is
that
you
want
to
try
to
avoid
moving
memory
or
data
kind
of
back
and
forth
between
the
CPU
and
the
GPU
as
infrequently
as
as
possible.
A
And
so
one
of
the
challenges
of
optimizing
for
the
GPU
is
that
there
are
kind
of
multiple,
optimization
Avenues
that
you
typically
want
to
pursue.
So
I've
highlighted
kind
of
two
of
them
so
far,
so
you
one
of
them,
is
the
expressing
more
parallelism
in
your
application.
A
A
second
is
making
use
of
the
very
fast
memory
on
the
GPU,
but
also
realizing
that
moving
data
between
the
GPU
and
CPU
is
is
not
fast,
and
then
there
are
other
sort
of
higher
order.
Considerations
that
you'll
want
to
consider.
Like
you
know,
every
time
you
launch
a
piece
of
work
or
kernel
on
the
GPU,
there
can
be
some
overhead,
and
so
you
want
to
try
to
make
that
work
in.
A
In
general,
as
long
and
significant
as
possible,
so
sometimes
you
want
to
kind
of
fuse
shorter,
kernels
together
and
even
though
that
the
high
bandwidth
memory
is
fast
on
the
gpus,
there's
still
an
opportunity
to
use
things
like
the
registers
or
the
cache
or
shared
memory
on
the
GPU
to
get
even
faster
performance.
A
So
you
kind
of
realized
that
this
was
a
a
kind
of
challenging
foreign
sort
of
activity
for
the
community,
and
so
one
of
the
things
that
we
worked
with
Nvidia
on
is
sort
of
providing
tools
that
get
you
a
lot
of
information
about
your
code
that
is
actually
actionable
and
can
help
tell
you
which
of
these
optimization
directions
you
can
move
in
and
so
insights
by
Nvidia
is
kind
of
a
new
Performance
Tool.
A
That
includes
some
additional
functionality,
based
on
our
conversations
and
relationships
with
them,
and
so
one
of
what
one
of
the
new
features
that
they
provide
in
that
tool
is
a
roof
line,
analysis
application,
and
so
what
this
can
tell
you
is
sort
of
where
your
application
falls
on
a
roofline
plot
which
I'm
showing
here,
which
kind
of
characterizes
your
applications
in
terms
of
its
data
movement
versus
overall
performance
and
against
the
ceiling,
and
it
can
kind
of
tell
you
which
directions.
You
might
look
to
optimize
your
application
in.
A
So
overall.
What
we've
done
with
the
Nissan
program
is
to
partner
with
a
set
of
application
development
teams,
along
with
our
vendor
Partners
like
Nvidia
and
cray
hpe,
to
work
with
them
to
prepare
their
codes
for
Pearl
Mudder
at
a
pretty
deep
level,
and
what
we
like
to
do
now
is
kind
of
share
those
Lessons
Learned
with
with
with
the
greater
nurse
Community
kind
of
you
all
that
are
attending
this
training.
A
And
let
you
know
some
of
the
opportunities
continue
working
working
with
us,
and
so
we
selected
about
25
applications
across
the
simulation
data
and
learning
spaces,
and
it
was
sort
of
an
all
hands
on
deck
activity.
Here
are
some
of
the
staff
members
that
participated.
A
A
number
of
them
will
be
available
to
kind
of
chat
throughout
the
throughout
the
day
today
and
are
participating
in
the
the
Hands-On
sessions,
and
one
of
the
things
that
I
really
want
to
highlight
to
to
you
all
today
is
an
opportunity
that
still
is
ongoing.
So
these
one
of
the
most
fruitful
activities
that
we
that
we
kind
of
engaged
in
as
part
of
the
Nissan
program
is
these
hackathons
and
we
had
kind
of
had
two
different
types
of
hackathons.
A
So
one
was
sort
of
private
hackathons
that
were
part
of
our
contract
to
deliver
promuder,
but
secondly,
is
completely
public
hackathons
that
you
can
find
for
yourself.
If
you
go
to
this
website,
gpuhackathons.org.
A
And
these
are
hackathons
that
occur
almost
on
like
a
monthly
basis
all
over
the
country
and
are
really
led
by
Nvidia
itself
and
nurse
I
think
has
provided
more
team
members
than
any
other
institution
to
these
worldwide
events.
Over
the
last
couple
years,
we
are
hosting
one
ourselves
upcoming
later
this
year.
You
can
find
the
information
at
gpuhackathons.org,
and
these
are
open
for
anyone
with
an
application
that
they
want
to
help.
A
They
want
help
with
GPU
optimizations
on
and
I
think
what
this
is
doing
is
it's
allowing
us
to
reach
kind
of
nurse
teams
that
are
all
around
the
country
kind
of
really
actually
all
around
the
world
and
to
help
amplify
the
nisap
program.
So
probably,
if
I
have
one
takeaway
I
want
to
take
away
from.
A
My
talk
here
today
is
to
go
check
out
this
link,
GPU
hackathons.org,
and
consider
participating
yourselves
in
one
of
these
upcoming
upcoming
hackathons,
so
we're
gonna
have
kind
of
like
a
Hands-On
session
later
later
today,
and
maybe
that'll
give
you
kind
of
like
a
taste
for
what
you
can
kind
of
accomplish
in
a
kind
of
a
more
Deep
dive
type
event
like
one
of
these
one
of
these
hackathons.
A
Okay,
so
I
want
to
then
kind
of
talk
about
some
of
what
is
possible
to
accomplish
so
through
our
partnership
with
different
applications,
partly
at
these
sort
of
like
hackathon
events,
we
worked
with
a
bunch
of
different
code
teams
on
improving
their
performance
on
GPU,
so
one
of
those
code
teams
was
labs.
A
So
Lance
is
a
classical
molecular
Dynamics
code.
It
has
sort
of
a
focus
on
materials,
modeling
and
there's
a
production
version
of
lamps
that
uses
Cocos,
which
is
sort
of
how
they're
accessing
the
the
GPU.
If
you
don't
know
what
Cocos
is
that's
a
it's?
A
Okay,
it's
just
kind
of
like
a
GPU
sort
of
freight
framework
and
it
had
already
sort
of
been
optimized
the
kind
of
a
relationships
with
the
with
the
vendors,
but
what
they
found
is
that
going
to
these
hackathons
in
particular,
and
looking
at
the
kernels
that
were
their
computational
bottlenecks,
that
there's
opportunity
to
kind
of
rewrite
these
for
the
gpus
and
get
additional
performance.
A
So
in
particular,
you
can
see
here
kind
of
like
the
speed
up
over
time,
I
think
starting
with
a
value
of
one
here
and
progressing
over
time
as
they
work
towards
optimizing
their
application
for
pro
Mudder.
In
particular,
you
can
see
these
hackathons
are
places
where
they
made
significant
progress
very
quickly.
A
These
near
vertical
bars
are
those
those
hackathons
and
they
ended
up
finding
it.
There
was
lots
and
lots
of
kind
of
performance
that
could
be
gained
by
tuning
the
code
for
the
for
the
GPU
for
the
GPU
architecture,
and
this
actually
led
to
some
really
nice
nice
scientific
results
that
were
finalists
for
the
this
Gordon
Bell
prize
at
the
super
Computing
conference
and
really
represents
the
the
largest
ever
molecular
Dynamics
simulations
to
date.
A
With
this
sort
of
fidelity
up
to
20
billion
atom
calculations,
one
of
the
things
that
you
can
see
here
is
the
performance
in
terms
of
like
million
atom
steps
per
second,
and
what
would
be
ideal
here
is
for
the
code
to
have
a
flat,
a
flat
curve
on
this
plot,
where
you
scale,
you
scale
up
the
problem
and
the
number
of
gpus
used.
A
At
the
same
time,
you
can
see
that
they're,
comparing
the
performance
on
The
Summit
computer
at
Oak,
Ridge,
National,
Lab
and
then
Pro
Mudder,
as
well
as
the
kind
of
an
internal
Nvidia
cluster,
and
the
difference
between
these
two
curves
here
is
essentially
coming
from
the
difference
of
generations
of
GPU.
A
And
so
this
has
kind
of
been
part
of
a
series
of
large-scale
calculations
that
have
come
out
of
the
Nissan
program
that
have
been
kind
of
highlighted
at
Super
Computing
as
part
of
this
Gordon
Bell,
the
prize
series
and
we're
happy
to
say
that
one
of
our
Nissan
teams,
including
one
of
the
nurse
staff
members,
was
a
winner
of
the
prize
in
in
2022.
A
So
this
is
using
the
warp
X
code
and
you
can
see
other
Nissan
applications
have
been
able
to
really
accomplish
some
great
large-scale
science
outcomes
as
well.
So
in
terms
of
overall
optimizations
I
would
say
that
the
good
news
is
that
we've
seen
that
many
applications
have
been
successful
preparing
for
promoter
and
we
think
that
what
we've
learned
can
be
kind
of
applied
to
other
applications
as
well,
and
one
of
the
ways
that
we'd
like
to
keep
engaging
with
the
broader
nurse
Community
is
through
training.
A
Events
like
today,
but
also
if
you
think
that
you
could
benefit
from
a
deeper
dive
with
their
staff
and
with
experts
from
like
Nvidia
at
your
side,
and
we
really
encourage
everyone
to
join
these
community
hackathons
at
gpuhackathons.org
and
there's
events
kind
of
all
over
the
country
in
the
in
the
upcoming
six
months
in
the
upcoming
year
that
you
can
consider
joining
you
know,
one
of
the
things
I
highlighted
is
that
there
are
multiple
kind
of
GPU
optimizations
that
exist,
and
so
one
of
the
things
that
we
really
highlight
at
these
events
is
kind
of
profiling
and
analyzing.
A
Your
application
to
determine
kind
of
which
optimization
paths
make
the
are
likely
to
be
the
most
profitable
for
your
application
and
I'm
going
to
switch
gears
a
little
bit
over
the
next
few
slides
just
to
kind
of
talk
about
the
capability
of
the
system
itself
and
some
of
the
configuration.
A
One
of
the
things
that
I
want
to
highlight
is
that
Pearl
Mudder
kind
of
supports
every
GP
programming
model
out
there
if
you've
been
following
a
little
bit
about
what's
going
on
in,
like
Oak
Ridge
and
Argonne,
they're
they're
also
deploying
really
large
scale,
GPU
systems
powered
by
AMD,
gpus
or
Intel
gpus,
and
they
tend
to
support
some
subset
of
this
chart.
But
because
promoter
is
based
on
kind
of
like
the
maybe
longest
turn
the
longest
stand
in
kind
of
GPU
Computing
vendor.
A
A
The
nurse
Community
has
existing
GPU
applications
that
may
be
built
with
Cuda,
open,
ACC,
even
Cuda
Fortran,
and
we
really
want
to
kind
of
meet
you
where
you
are,
and
those
are
all
enabled
on
Corey
sorry
on
promoter.
A
But
we
also
recognize
that
things
like
performance
portability.
The
ability
to
rug,
both
at
nurse
and
other
doe
facilities
and
also
other
HPC
facilities
around
the
world
is
important
to
our
community,
and
so
one
of
the
things
that
we
worked
on
with
it
with
Nvidia
is
making
sure
that
there's
a
performance,
portable
path
forward
using
open
NP,
which
was
something
that
we
highlighted
a
lot
for
Corey
as
a
way
to
get
the
most
performance
out
of
the
system.
A
A
In
addition,
if
you
are
working
on
porting,
your
applications
to
those
systems
at
Oakridge
and
Argonne,
we
do
support
DPC,
plus
plus
execution
on
the
system
as
well
as
hip.
I.
Think
you
can
see
on
this
plot.
Hip
is
supported
as
well
as
very
popular
C
plus
Frameworks,
like
Cocos,
for
getting
the
most
performance
portably
on
a
system
using
sort
of
C
plus
plus
applications.
A
A
A
I'm
going
to
talk
a
little
bit
about
Jupiter
in
a
second
and
the
a
lot
of
the
same
debuggers
and
profilers
are
available
on
the
system
that
you
may
be
used
to
from
from
past
system,
so
that
includes
DDT
and
craypath,
and
then
Nvidia,
as
I
kind
of
highlighted
earlier,
provides
a
very
useful
set
of
GPU
profiling
tools
based
on
the
Insight
profiling
package.
A
Okay,
so
one
of
the
things
I
wanted
to
quickly
highlight
is
that
Jupiter
is
available
on
on
the
system
and
you
can
access
it
via
Jupiter,
the
the
nurse
Jupiter
hub,
it's
kind
of
just
similarly
to
you
to
how
you
would
utilize
it
on
on
Corey
in
terms
of
trying
to
make
it
the
system
as
kind
of
usable
for
all
of
you
as
possible.
A
Another
is
to
make
sure
that
we
have
existing
applications,
that
we
know
the
community
uses
pre-installed
and
optimized
for
the
gpus
on
the
system
already.
So,
for
example,
if
you're
a
vasp
user,
we
have
GPU
optimized
vast
installed
and
ready
to
go
as
long
as
you
have
a
of
a
have
a
vast
license
and
we've
put
together
a
lot
of
information
on
using
Pro
Mudder
in
our
Doc
in
our
docs
pages.
A
So
if
you
haven't
already
make
sure
you
kind
of
check
out
the
promoter
docs
at
docs.nurse.gov,
I've
talked
a
little
bit
about
the
tools
and
I
just
want
to
highlight
once
again
how
valuable
I
think
the
these
hackathons
can
be.
So,
if
you're,
a
person,
who's
working
on
developing
an
application
and
want
and
want
to
make
sure
it
runs
well
on
a
promoter.
A
I
think
that
these
GPU
hackathons
are
a
really
great
opportunity
to
work
with
us
and
work
with
the
vendor
Nvidia,
for
example,
to
to
make
sure
that
these
your
code
is
running
as
as
well
as
it
possibly
can.
A
A
So
this
this
work
is
essentially
ready
to
be
used
by
anyone.
You
can
now
program
to
the
gpus,
using
openmp
as
your
programming
model
of
choice
and
using
the
Nvidia
compilers
to
to
to
compile
and
run
your
application.
A
Okay,
so
I
want
to
kind
of
close
here
with
a
few
science
examples,
so
one
that
I
want
to
start
with
is
sort
of
one
of
those
kind
of
near
and
dear
to
my
heart.
This
is
an
application
that
I've
worked
on
for
for
many
years.
A
A
So
in
this
particular
case,
there's
sort
of
a
a
defect
in
this
pure
silicon,
crystal
sort
of
at
the
center
of
the
diagram
here
that
causes
sort
of
localized,
States
or
states
that
kind
of
quantum
states
that
are
sort
of
centered
on
that
defect,
and
these
are
kind
of
challenging
calculations,
because,
in
order
to
study
a
defect,
you
kind
of
have
to
have
a
large,
a
large
sample
of
the
material
in
order
to
kind
of
capture
these.
A
These
complex
States,
and
so
these
are
some
of
the
largest
calculations
ever
done
to
date,
and
I
want
to
kind
of
talk
a
little
bit
about
the
performance
here.
A
So
we're
comparing
the
performance
of
v100s
like
exist
on
the
summit
system
at
Oak
Ridge
compared
to
the
a100s
that
exist
here
now
at
nurse,
and
you
can
see
some
significant
performance
advantages
in
terms
of
times
of
solutions,
so
here
lower
lower,
is
better
as
you're
moving
towards
that
newer
GPU
architecture
and
the
ability
of
the
application
to
scale
to
the
to
the
near
full
Chrome
runner
system.
A
Another
Science
example
comes
from
the
exobiome
space
or
the
bioinformatics
space
I
should
say
so.
Exobiome
is
a
particular
kind
of
application
in
the
space
that
deals
with
metagenomics
or
kind
of
resolving
and
identifying
organisms
from
a
sample
that
you
would
get
from
like
a
real
world
environment
where
you
end
up
with
many
different
microbes
present
at
the
same.
A
At
the
same
time,
this
is
a
challenging
problem
to
to
GPU
accelerate,
but
they
made
ended
up
making
quite
a
lot
of
progress
and
have
written
and
deployed
the
world's
fastest
GPU
aligner.
A
Targeting
specifically
the
a100
gpus
and
here's
some
of
the
performance
increase
you
can
see.
So,
if
you're
looking
at
the
time
to
solution
again,
Lower
is
better
running
on
the
CPUs
of
promoter
versus
gpus.
You
can
see
some
significance
increase
even
up
to
very
large
node
counts
by
using
the
the
gpus
of
the
of
the
system.
A
Here's
an
example
that
comes
from
the
computational
fluid
dynamics
or
cfd
case,
and
this
kind
of
represents
a
trend
that
we're
beginning
to
see
in
our
user
applications
where
there's
a
combination
of
traditional
simulation
and
modeling
with
AI
or
deep,
deep
learning,
type
methods,
and
so
what
they've
done
here
is
put
together.
A
traditional
computational,
fluid
Dynamic
solver,
where
you
kind
of
time
step
through
a
a
fluid
Dynamic
simulation,
but
in
the
middle
they're,
using
a
neural
network
to
basically
upscale
or
kind
of
get
a
super
resolution.
A
View
of
the
of
the
fluid,
and
the
idea
here
is
to
get
the
Fidelity
or
accuracy
of
a
finer
mesh
calculation
for
the
kind
of
cost
of
a
kind
of
reduced
order
or
coarser
grained
calculation
and
in
particular
the
gpus,
are
used
both
at
simulation
time.
A
But
as
well
as
training
time
to
train
that
neural
network
and
the
gpus
are
many
orders
of
magnitude
faster
at
that
process
than
the
CPUs
and
even
compared
to
the
late,
the
the
previous
generation
gpus,
the
V100
gpus
that
are
on
you,
know
our
test
bed
and
Quarry
as
well
as
Summit.
A
The
the
newer
gpus
are
two
or
two
or
three
times
faster
at
that
training
step
than
the
previous,
the
previous
generation
and
so
I
think
I'll
give
maybe
one
more
science
example
here,
which
is
a
molecular
Dynamics
calculation
that
a
team
put
together
to
run
on
Pro
Mudder,
and
so
what
I
think
is
really
one
of
the
really
exciting
aspects
of
this
calculation
is
that
this
team
realized
that
they
can
use
lower
floating
Point
Precision
so
either
six.
A
You
know,
I,
guess
this
mix,
16
32-bit
precision
to
run
on
Pro
Mudder
and
by
using
that
lower
Precision
they're
able
to
break
the
exaflop
barrier
exceeding
one
1.1
hexaflops
at
the
lower
at
the
lower
Precision
running
at
essentially
full
scale
or
near
full
scale
on
the
on
the
gpus.
A
So
they
were
like
this
was
sort
of
like
a
covid
specific
application,
with
83
million
83
million
atoms
and
taking
advantage
of
the
tensor
core
capability
of
the
of
the
processors,
so
I'm
just
going
to
skip
ahead
a
little
bit
and
talk
about
one
last
set
of
of
applications
and
then
I'll
close.
The
the
presentation
here
so.
A
A
So
this
is
from
the
lcos
at
Stanford
and
Sam
and
for
electron
microscopy
as
well
as
high
energy
physics
like
LZ,
and
the
Desi
experiments
in
the
high
energy
physics
and
kind
of
astrophysics
domains,
and
all
of
these
applications
are
now
kind
of
up
and
running
on
on
Pearl
Mudder
and
producing
science
results
that
they
really
couldn't
have
without
the
the
kind
of
scale
and
capability
of
the
promoter
system
to
highlight
one.
A
In
particular,
if
you
look
at
the
Desi
project,
they
are
seeing
over
two
and
a
half
X
improvements
and
perno
throughput.
A
A
With
these
with
these
science
stories
and
hope
that
some
of
these
have
been
kind
of
inspiring
about
what
what
you
can
do
with
promoter
as
well-
and
we
are
you
know-
I
just
want
to
close
and
say
like
I-
think
we
are
really
really
excited
to
see
what
you
all
do
with
Pro
Mudder
over
the
life
of
the
system.
We're
really
excited
about
the
science
and
the
discovery.
That's
going
to
be
done
with
the
with
the
pro-mender
system
during
its
lifetime
and
I.
Think
I
will
stop,
sharing
and.
A
Just
a
reminder
to
put
all
to
put
any
questions
you
have
into
the
Google
Doc
and
we
will
I
think
reply
in
the
Google
Doc.
If.