►
From YouTube: GPUs 101
Description
Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/
A
Really
excited
to
see
all
the
the
people
who
are
attending
here-
and
you
know,
I
think
the
point
of
my
talk-
is
to
really
tee
up
the
rest
of
the
day.
And
so
I
just
want
to
give
you
kind
of
an
introduction
to
pearl
mudder.
Why
we're
kind
of
using
gpus
and
just
some
of
the
really
high
level
gpu.
A
Kind
of
technology
features
and
and
top
level
considerations
that
you
need
to
kind
of
think
about
as
you're
moving
your
application
over
to
to
gpus
and
also
kind
of
end,
with
some
of
the
progress
that
we've
been
making
with
applications
that
we've
been
working
with
on
reporting
or
optimizing
their
their
codes
for
for
gpu
systems.
A
A
As
we
deployed
quarry,
I
think
we
started
this
transition
towards
exascale
energy,
efficient
type,
architectures
and
so
corey
was
powered
by
the
many
core
intel
knights,
landing
xeon
phi
processors
and
with
nurse
nine
we're
deploying
we
have
deployed.
I
guess
our
first
cpu
gpu
accelerated
hpc
architecture,
and
you
can
see
that
nurse
10
is
expected
to
be
our
first
sort
of
like
exascale
class
system
arriving
in
the
2024,
maybe
2025
time
frame
in
terms
of
the
greater
kind
of
doe
hpc
ecosystem
within
the
office
of
science.
A
You
can
see
that
gpus
are
really
beginning
to
play
an
important
role
so
with
summit
at
the
oakridge
leadership
computing
facility,
there's
already
a
cpu
plus
gpu
system
powered
by
the
nvidia
volta
gpus.
A
A
So
why
is
this
happening?
And
I
think
you
know
I
think
the
short
answer
is.
A
It
allows
us
to
deliver
more
capability
in
terms
of
overall
flops
and
memory
bandwidth,
and
you
know,
operations
per
per
second
for
for
kind
of
less
power,
so
here's
an
example
of
an
application
running
on
edison
versus
versus
summit
and
on
the
y-axis.
You
have
time
on
the
x-axis.
A
You
have
power
and
so
time
times
power
is
is
energy,
and
so
you
can
see
as
you're
kind
of
moving
down
the
diagonal,
then
you're
improving
on
energy,
energy
efficiency
and
what
you're
seeing
is
the
same
application
running
on
edison
sort
of
that
traditional
cpu
multi-core
architecture
versus
summit
at
oak
ridge,
the
gpu
architecture,
and
you
can
see
that
it's
essentially
sort
of
like
an
order
of
magnitude
improvement
that
you're
getting
by
using
the
the
accelerators.
A
And
so,
as
I
mentioned,
this
change
has
kind
of
arrived
and
it's
really
kind
of
driven
by
the
power
consumption
towards
these.
These
lightweight
cores,
and
so
you
know,
I
think
we
found
that
corey
using
the
many
core
architecture
has
been
kind
of
a
boon
because
of
this.
The
kind
of
new
capabilities-
and
this
continues
with
the
the
gpu
architectures
on
on
promoter.
A
So
one
of
the
things
I
kind
of
want
to
highlight
in
this
slide
is
what
are
the
kind
of
the
main
concepts
that
you
need
to
think
about
when
programming
for
gpus,
in
particular
the
a100
gpus,
and
I'm
going
to
start
with
kind
of
two
main
concepts
that
I
think,
if
you
gather
those,
I
think
you
kind
of
have
have
got
what
most
of
what
you
you
need
to
understand
about
gpus
and
we'll
talk
about
a
couple.
A
couple,
other
concepts
as
well.
A
A
Well,
I
guess
this
this
might
have
been
for
kml.
I
guess
very
for
hospitalization
32
cores
to
what
you
might
consider
108
sms
on
a
on
a
gpu
socket
so
108
what
we
call
streaming
multiprocessors.
A
A
Two
can
really
be
active
at
a
time,
but
you
can
actually
kind
of
over
subscribe
them
to
get
additional
levels
of
performance
to
get
really
the
most
out
of
a
haswell
cpu.
You
have
to
think
about
using
vector
operations
instead
of
scalar
operations,
so
there
are
two
256
bit
wide
vectors,
so
that
would
be
like
four
double
precision
in
instructions
times
times
two,
whereas
a
gpu
you're,
really
thinking
about
32,
simply
threads
per
warp,
and
you
can
kind
of
think
of
these
as
a
little
bit
similar.
A
There
are
some
important
differences
between
what,
like
you,
can
do
in
a
vector
instruction
on
the
cpu
versus
what
these
kind
of
simply
threads
can
do
on
a
gpu.
But
you
really
should
be
kind
of
thinking
about
these
as
32
operations
that
are
kind
of
working
on
different
data,
but
doing
the
same
instruction
every
every
cycle,
and
so,
if
you
think,
if
you're
talking
about
double
precision,
then
you're
really
getting
around.
A
Like
2
000
weight,
parallelism
64
times
4
times,
eight
versus
something
like
200
000
way,
parallelism
on
a
a
gpu,
so
significantly
more
parallelism
by
orders
and
magnitude
that
are
necessary
to
really
keep
a
gpu
busy,
and
so
the
you
know,
one
of
the
things
that
you
can
think
about
on
the
cpu.
Are
these
hyper
threads
that
help
you
kind
of
hide,
latency
or
hide
different
kind
of
waiting
time
or
stalls
on
the
on
the
processor
and
that's
similar
to
the
to
the
gpus?
A
Where
you
really
want
to
kind
of
over
subscribe,
the
gpus
with
either
you
know
more
warps
per
sm
or
more
streams
to
really
help
hide
any
latency
that
your
that
your
application
might
might
be
seeing.
A
So
that
that's
number
one,
you
can
see
that
the
the
amount
of
parallelism
has
gone
up
by
like
one,
not
just
one
order
of
magnitude,
but
really
like
two
or
three
orders
of
magnitude
when
moving
from
the
kind
of
cpu
architecture
to
the
gpu
architecture
and
the
the
next
main
concept
I
want
to
talk
about
is
that
the
the
gpu
memory
is
very
fast
and
your
application
can
really
take
advantage
of
that.
A
A
So
if
you
look
at
the
same
kind
of
comparison
between
the
haswell
cpu
that
exists
on
kind
of
like
the
haswell
nodes
of
corey,
you
have
a
total
of
128
gigabytes
of
ddr,
whereas
the
on
the
gpu
itself
on
perlmutter
on
a
single
gpu,
you
have
40
gigabytes
of
of
h,
of
hbm
or
high
bandwidth
memory,
so
on
haswell,
the
bandwidth
that
you
can
get
from
that
memory
is
128
gigabytes
per.
A
A
On
the
other
hand,
what
is
really
slow,
much
slower
than
both
of
those
memories,
is
the
the
pci
express
bandwidth
of
transferring
data
back
and
forth.
You
can
see
that's
sort
of
an
order
of
magnitude
less
than
the
than
the
memory
bandwidths
that
we've
been
been
talking
about.
So
that's
kind
of
the
slowest,
the
slowest
pipe
or
the
slowest
data
transfer
speed
on
the
on
the
node,
and
we
want
to
kind
of
avoid
moving
data
back
and
forth
frequently.
A
So
you
you
know,
for
your
application
to
get
the
the
best
performance
out
of
the
out
of
the
gp.
You
really
want
to
kind
of
keep
the
the
data
on
the
cheap
view
and
get
the
most
out
of
that
that
really
high
bandwidth
memory
and
avoid
moving
it
back
and
forth
frequently
between
the
cpu
and
the
gpu,
which
is
which
is
really
the
the
slow
part.
A
So
those
are
really
the
top
two
concepts
for
for
programming
a
gpu,
and
I
think
everything
else
is
sort
of
a
second
order
consideration.
I
think
if
you
have
lots
of
parallelism-
and
you
realize
that
kind
of
getting
the
data
onto
the
gpu
and
trying
to
keep
it,
there
is
important,
then
you're,
like
90
there
in
terms
of
cpu
performance.
A
There
are
a
number
of
second
order
considerations.
So
let
me
talk
about
just
a
few
of
those
right
now.
So
one
is
that
when
you're
defining
your
kernels-
and
I
think,
you'll
kind
of
hear
about
this
in
some
of
the
upcoming
presentations-
there
is
some
sort
of
overhead
in
launching
each
of
those
kernels.
A
So
you
don't
want
those
kernels
to
be
kind
of
really
really
super
short,
and
so
some
techniques
that
you
can
use
are
like
fusing
short
kernels
together
to
to
kind
of
have
longer
execution
times,
and
you
know
there's
a
possibility
of
doing
things
like
defining
cuda
graphs.
If
you
have,
if
you
do,
have
a
lot
of
really
short
kernels,
that
kind
of
depend
on
each
other
or
need
to
execute
in
a
certain
order.
A
You
can
kind
of
tell
the
gpu
about
them
ahead
of
time
by
defining
the
kind
of
a
kind
of
like
a
graph,
and
that
can
help
eliminate
some
of
the
the
overhead
of
launching
these
these
individual
kernels
and
then
for.
A
I
think
I
mentioned
that
the
high
bandwidth
memory
is
fast,
but
just
like
on
a
cpu,
it's
actually
better
to
keep
your
data
in
registers
in
cache
or
in
what
is
available
on
a100
called
shared
memory
to
keep
the
data
even
closer
to
the
compute
units
when
possible.
So
you
know,
I
think
a
lot
of
our
applications
at
nurse
tend
to
be
more
dependent
in
terms
of
their
performance
on
the
ability
to
move
data
quickly
rather
than
to
just
compute.
A
You
know
as
many
flops
as
possible
and
so
keeping
the
data
as
as
fast
as
close
to
the
compute
units
as
possible
can
can
help
and,
in
particular,
for
many
applications.
We
find
that
there's
it's
important
to
kind
of
like
experiment
and
find
an
optimal
balance
between
maximizing
the
parallelism,
which
I
really
highlight
a
lot
in
this
first
point
here
and
minimizing
the
amount
of
kind
of
spilling
of
data
out
of
the
register.
A
So
the
gpu
has
kind
of
a
fixed
number
of
registers
and
in
some
sense,
the
more
parallelism
you
expose
or
the
more
kind
of
warps
you
have
active,
the
more
likely
it
is
that
your
data
might
might
spill
from
the
registers,
and
so
there
is
some
experimentation.
That's
often
necessary
at
the
kind
of
last
level
of
optimizing,
your
application
to
find
that
optimal,
that
optimal
balance.
A
Okay,
so
I
feel,
like
you
know,
if
you
get
number
one
and
two,
if
you
have
that
in
your
head,
then
I
think
you
understand
the
most
important
things
about
programming
for
a
gpu
and
then
there's
a
number
of
second
order
considerations.
These
are
really
kind
of
just
two
big
examples
of
them,
but
you
know
these.
A
These
are
the
type
of
things
that
will
help
you
kind
of
just
tune
and
really
get
the
the
absolutely
best
performance
that
you
can
out
of
out
of
a
gpu
out
of
a
gpu
system
in
terms
of
which
programming
models
you
use
to
kind
of
express
these
concepts.
A
A
We
realize
that
a
lot
of
our
users
had
codes
that
already
had
gpu
implementations
that
are
maybe
using
cuda
or
open
acc,
and
we
wanted
to
kind
of
make
sure
that
the
system
supported
those
well
and
in
addition,
we
are
supporting
programming
models
that
are
maybe
the
primary
choices
for
people
who
are
targeting
those
amd
gpus
like
hip,
or
the
intel
gpus
like
dpc,
plus
plus
and
sickle,
and
those
will
be-
I
mean,
those
are
enabled
and
available
to
to
kind
of
execute
on
the
on
the
perlmutter
system
as
well,
then
we
had
a
you
know,
particularly.
A
And
so
we
worked
with
nvidia
and
what
was
the
kind
of
pgi
group
to
make
sure
that
openmp
offloading
worked
was
was
kind
of
supported
and
worked
well
in
the
openmp
pgi
and
what
is
now
the
nvidia
hpc
compilers,
and
so
this
has
led
to
the
release
of
the
openmp
production
offload
compiler
as
of
basically
last
april,
and
it
continues
to
improve
kind
of
with
every
every
release
of
the
nvidia
hpc
sdk.
A
A
A
We
use
it
a
lot
for
corey
and
I
know
other
centers
out
there
have
used
hackathons
really
effectively
and
they're
kind
of
effective,
not
just
because
of
oh,
like
the
technical
things
that
you
learn
at
a
hackathon,
but
really
just
because
of
the
sort
of
sociology
of
them,
I
think,
being
surrounded
by
a
lot
of
people
who
are
doing
sort
of
the
same
thing
that
that
you
are
is
really
kind
of
a
contagious
environment.
A
That
sort
of
takes
you
out
of
your
day
job
for
a
couple
days
to
really
focus
on
your
on
your
application,
with
kind
of
all
the
right
experts.
Looking
over
your
your
shoulders
so
go
check
out,
wwgpu
hackathons.org
nurse
staff,
I
think
provided
more
team
mentors
than
any
other
institution
to
these
worldwide
events
in
2020,
and
it's
really
allowed
us
to
reach
nurse
teams
kind
of
all
around
the
country
and
really
kind
of
all
around
the
world.
A
These
hackathons
are
kind
of
adapted
from
what
was
sort
of
an
in-person
event
to
remote
events,
but
I
think
they've
they've
managed
to
really
to
really
be
very
useful,
very
profitable,
and
I
think
that
even
features
of
this
sort
of
remote
hackathon
format
will
end
up
being
incorporated
into
future
future
hackathons,
even
whenever
we're
kind
of
at
a
new
normal
beyond
this
pandemic
pandemic
state,
but
bottom
line
go
check
out,
gpu
hackathons.org
and
see
if
there's
an
event
coming
up
that
that
you
can
attend.
A
There's
a
number
of
ways
that
we
are
trying
to
take
what
we've
been
learning
working
with
applications.
You
know
partly
at
hackathons,
partly
as
part
of
our
nisap
kind
of
partnership
program
and
expand
and
deliver
that
to
the
community
at
large.
So
one
of
the
things
that
we're
doing
is
really
working
closely
with
the
programming
models
and
languages
team,
and
I
think
again,
you're
going
to
hear
a
lot
about
this
in
the
next
presentation
to
make
sure
that
our
community
needs
are
being
considered
and
adapted
as
the
kind
of
accelerator
programming.
A
It's
standardized
within,
like
the
c-plus
plus
and
the
fortran
standards,
as
well
as
kind
of
important
frameworks
like
cocos
and
openmp,
and
openacc
get
developed
and
expanded.
A
One
of
the
ways
that
you
can
take
advantage
of
promutter,
even
if
you
don't
write
your
own
code-
is
by
utilizing
the
best
possible
the
best
possible
installation
or
version
of
the
community
codes
out
there
that
are
optimized
for
for
promoter.
A
So
there's
a
number
of
applications
that
we
provide
on
promutter,
that
we've
worked
with
the
developers
to
kind
of
improve
their
performance
and
make
sure
that
what
we,
what
we
have
available,
is
really
optimized
for
the
for
the
architecture,
and
so
these
are
just
some
of
the
examples
of
applications
that
that
we
provide
checking
out
the
nurse
documentation.
A
Our
different
training
events
like
like
this
event
today,
I
think,
is
a
great
way
to
to
kind
of
get
the
most
out
of
the
system,
and
then
I'm
going
to
talk
a
little
bit
more
about
the
work
that
we've
been
doing
with
vendor
tools
and
just
one
one
more
pitch,
because
I
think
this
is
really
one
of
the
takeaway
messages
that
I
have
for.
You
is
to
check
out
the
gpu
community
hackathons
and
see
if
there's
one
that
you
could
that
you
could
attend.
A
So
these
are
kind
of
all
virtual
for
the
time
being,
but
will
eventually
also
be
back
in
in
person.
A
So
in
terms
of
wrapping
your
head
around
sort
of
the
the
two
and
the
the
four
kind
of
concepts
I
talked
about
earlier
for
getting
the
most
out
of
gpus,
what
I
think
is
really
important
is
that
you
don't
do
it
in
a
vacuum.
Is
that
you
kind
of
use
some
tools
to
help
you,
and
so
you
know,
as
we're
thinking
about
the
the
optimization
challenge
that
our
teams
have
and
porting
to
and
optimizing
their
applications
for
for
pearl
mudder.
A
You
know,
I
think
that
we
found
that
they
have
kind
of
similar
questions
and
that
really
what
they
need
is
kind
of
an
absolute
sense
of
performance
when
optimizing
applications,
and
they
have
questions
like
how
do
I
know
if
my
performance
is
good
in
some
overall
sense
or
why
am
I
not
getting
the
peak
performance
that
was
advertised
on
the
page,
and
maybe
the
most
important
question
is:
how
do
I
know
when
to
stop
like?
When
is
when
is
my
performance
good
enough
that
it's
not
worth
investing?
A
You
know
another.
Several
months
of
my
time
to
try
to
improve
it-
and
you
know
I've
seen
a
number
of
presentations
I've,
even
given
a
number
of
presentations
where
people
present
a
result.
That
is
something
like
the
following,
like
my
application
is
running
two
times
faster
today
than
it
was
a
year
ago
and
in
some
sense
that's
great.
A
You
know
it's
always
better
when
your
application
is
running
faster
than
it
was
before,
but
in
another
sense
it's
it's
not
entirely
meaningful
because
you
don't
know,
you
know
where
you
stand
in
any
kind
of
absolute
sense.
Like
was
the
code.
You
know
terrible
to
begin
with,
and
now
it's
a
little
a
little
better
or
was
it
already
great
and
you're?
You
know
you
put
in
some
kind
of
like
ninja
hacking
activity
to
to
really
get
it
to
perform
the
the
2x
better.
A
So
I
think,
what's
really
important
is
to
know
where
you're
standing
something
kind
of
some
absolute
sense
that
can
guide
your
your
next
steps
and
in
particular,
as
you
saw
on
the
gpu
there's
many
potential
optimization
directions
that
you
can
take.
So
is
utilizing
the
the
high
bandwidth
memory.
What's
really
important
for
you
or
is
really
getting
the
most
out
of
the
the
the
different
levels
of
parallelism
available
on
the
gpu?
A
What's
really
important,
how
do
you
know
what
is
the
limiting
factor
in
your
app's
performance
and
again
I
think
it's
it's
quite
important
for
productivity
to
know
like
when.
When
is
the
performance
good
enough,
and
when
can
you?
When
can
you
stop,
and
so
what
we
found
is
that
framing
these,
these
conversations
or
like
framing
the
answers
to
the
questions
in
terms
of
a
simple
performance
model
called
the
roofline
model
and
the
gpus
is
a
really
good
way,
a
good
way
to
begin
thinking
about
it.
A
So
the
roof
line
basically
tells
you:
what
are
the
performance
ceilings
on
the
device
based
on
the
characteristics
of
your
application?
So
you
characterize
your
application
on
the
x-axis
in
terms
of
the
floating
point
operations
that
it
does.
You
could
also
think
if
you
don't
have
an
application
that
does
floating
point
operations,
you
could
also
think
of
it
in
terms
of
like
integer
operations
or
just
other
other
type
of
operations.
A
But
you
want
to
think
about
the
amount
of
operations
that
you're
doing
per
second
per
the
amount
of
data
that
you
need
to
transfer
from
some
level
of
the
memory
hierarchy
to
the
to
the
compute
units,
and
given
that
your
application
has
this
characteristic
there.
You
then
have
these
different
ceilings
that
limit
your
performance,
based
on
your
ability
to
utilize
different
parts
of
the
of
the
architecture.
So
this
is
whether
your
application
really
does
floating,
or
I
guess
what
are
called
fuse
multiply-
add
operations.
A
So
if
you
have
an
equal
number
of
multiplies
and
adds
in
your
in
your
application-
and
the
nice
thing
is
that
we
worked
with
nvidia
to
enable
this
analysis
directly
within
the
insight
performance
tool-
and
I
think
you're
gonna
hear
a
little
bit
about
this
next
week
at
the
training
from
nvidia
about
the
hpc
sdk,
and
so
you
can
kind
of
give
yourself
an
understanding
of
where
you
stand
against
the
kind
of
potential
of
the
architecture
in
an
absolute
sense
by
running
by
profiling.
A
Your
application,
with
with
insight-
and
you
know
one
thing
I
want
to
highlight
about
this-
is
that
there's
nothing
here
in
this
roofline
model
that
you
know
you
couldn't
find
in
different
profiling
tools
sort
of
already.
A
A
Okay,
so
we've
been
working
with
pretty
closely
with
a
number
of
applications
from
a
whole
bunch
of
different
different
science
areas
and
and
working
with
them
to
improve
their
application
for
the
pearl
mudder
system,
and
so
this
is
a
plot
of
where
some
of
those
applications
stand
from
different
algorithmic
areas
in
comparing
their
performance
on
the
perlmutter
system
versus
the
the
edison
system,
honest
kind
of
system
per
system,
performance
performance
comparison,
and
so
one
of
the
things
I
just
want
to
highlight
is
that
across
all
of
these
different
applica
application
areas
or
algorithmic
areas,
we
are
seeing.
A
A
You
can
see
that
application
performance
is
varying
anywhere
from
you
know
about
20x,
for
some
of
the
toughest
apps
to
optimize
on
the
gpu,
all
the
way
up
to
like
over
a
thousand
a
thousand
x
for
some
of
the
machine
learning
applications
that
are
able
to
take
advantage
of
some
of
the
low
precision
acceleration
that's
available
on
the
gpu
as
well.
So
let
me
I'm
just
going
to
end
my
talk.
A
I
know
that
I'm
sort
of
running
a
little
bit
low
on
time
here
and
so
I'm
going
to
end
my
talk
by
showing
you
a
few
examples
of
what
what
can
be
done
with
with
with
perlmutter.
So
the
first
application
I
want
to
talk
about
is
desi.
A
That
stands
for
the
dark
energy
spectroscopic
instrument,
and
I
think
this
is
a
particular
kind
of
app
to
start
with,
because
it's
related
to
the
namesake
of
the
of
the
system
promutter
himself,
who
kind
of
discovered
that
not
only
is
the
universe
expanding,
but
the
rate
of
expansion
is
actually
accelerating,
and
you
know
the
kind
of
the
the
the
term
for
that
is
kind
of
dark,
dark
energy,
and
so
scientists
believe
that
about
70
of
the
universe
is
dark
energy.
A
Although
we
don't
really
have
have
a
good
understanding
about
about
what
that
is,
and
so
the
desi
instrument
is
going
to
send
nurse
data
every
night
for
five
years,
and
this
data
will
kind
of
be
used
to
construct
a
really
detailed
map
of
the
universe,
to
better
understand
the
nature
of
of
dark
energy,
and
so
they've
been
working
to
accelerate
the
the
kind
of
key
desi
data
analysis
pipeline
on
on
perlmutter.
And
that's
what
you
see
here.
A
So
they
completed
kind
of
a
major
refactor
and
optimized
the
cpu
code,
with
the
first
gpu
port
really
coming
in
early
20
2020.
And
so
that's.
What
you
see
here
is
the
performance
of
the
the
gpu
port.
A
They
continue
to
optimize
the
application
over
a
series
of
kind
of
types
of
optimizations
targeting
different
features
of
the
of
the
gpu
and
the
and
the
very
latest
performance
of
the
application
on
perlmutter
has
a
kind
of
25x
improvement
in
per
node
throughput
using
perlmutter
compared
to
the
to
the
edison
baseline
or
the
initial.
A
The
initial
code
on
on
edison
xfl
so
x
is
an
application
that
uses
hpc
to
analyze
the
data
from
x-ray
free
electron
lasers,
and
I
think
one
of
the
interesting
things
here
is
that
this
is
a
community
who
wants
to
employ
kind
of
hpc
to
enable
real-time
data
analysis
to
make
decisions
and
analyze
their
data,
not
just
after
kind
of
their
beam
time
is
over
or
the
data
collection
time
is
over,
but
really
during
the
experiment
itself
or
during
their
data
collection
time
at
a
at
one
of
the
facilities
that
provides
these
x-ray
free
electron
lasers.
A
So
they
really
used
the
gpu
systems
at
nurse
to
develop
now
a
highly
scalable
application
that
analyzes
these
x-ray
diffraction
patterns
with
the
runtime.
That's
really
improved
by
many
orders
of
magnitude,
so
something
that
would
take
12.
You
know.
12
hours
on
edison
is
now
on
the
order
of
two
minutes
on
a
a
perlmutter
node,
and
you
can
see
that
they're
there
they've
also
been
working
a
lot
on
the
scaling
across
across
gpu
nodes
on
the
on
the
system.
A
Another
example
is
lamp,
so
lamps
does
molecular
dynamics.
Calculations
are
basically
molecules
kind
of
interacting
with
each
other
atoms
atoms
and
molecules
interacting
with
each
other
and.
A
Kind
of
moving
around
over
time
or
evolving
over
time,
and
so
this
was
an
application
that
had
an
existing
gpu
version,
but
they've
been
working
with
nurse
and
then
particularly,
I
want
to
highlight
the
effect
of
the
hackathons
that
they've
attended
in
2019
in
2020
to
improve
their
their
performance
and
through
those
efforts,
largely
centered
around
hackathons,
as
well
as
working
with
some
of
the
nvidia
engineers.
A
They've
been
able
to
get
a
22x
improvement
in
their
gpu
performance
to
the
point
where
their
node
versus
node
speed
up
with
a
new
application
on
perlmutter,
compared
to
kind
of
where
they
were
at
on
edison.
To
begin
with
is
over
250
50x.
A
So
that's
a
combination
of
using
the
new
system,
as
well
as
the
improvements
that
they
put
into
the
into
the
application,
and
after
this
effort
they
were
really
able
to
achieve
some.
Some
pretty
impressive
runs
on
perlmutter
and
other
gpu
systems
that
they
couldn't
have
really
done
without
all
of
the
gpu
acceleration
and
improvements
that
they
made
to
the
to
the
code,
and
this
was
recognized
as
a
gordon
bell,
finalist
in
sc
back
in
back
in
november.
A
The
super
computing
conference
back
in
in
november-
and
I
think
one
of
the
interesting
things
that
they
were
able
to
measure
is
the
performance
in
terms
of
like
atom
steps
or
millions
of
atom
steps
per
gpu
per.
I
guess
per
gpu
second,
and
so
this
sort
of
takes
out
the
the
the
the
scale
factor
and
what
you'd
ex?
A
What
you'd
expect
or
hope
to
see
here
would
be
like
a
kind
of
straight
line
across
this
graph,
that
no
matter
how
many
gpus
you
use,
you
get
roughly
the
same
performance
in
terms
of
atom
steps
per
gpu,
and
you
can
see
that
they've
used
three
different
systems,
perlmutter
summit
and
selene.
So
promutter
and
selene
both
have
the
latest
generation
a100
gpus,
whereas
some
of
that
oak
ridge
was
using
the
previous
generation
b100
gpus
and
you
can
see
that
they're
getting
great
performance
on
pro
mutter
and
roughly,
like
a
one.
A
A
I
think
I'm
running
a
little
bit
low
on
time,
because
I'm
supposed
to
be
finishing
and
maybe
taking
some
questions
here,
so
I'll
kind
of
go
through
some
of
these
quickly,
but
this
is
another
comparison
between
the
previous
generation
v100
gpu
from
summit
and
the
a100
gpus
from
role
moder
and
again
you're,
seeing
about
a
factor
of
1.6
or
in
some
cases
close
to
a
factor
of
two
in
in
performance
improvements.
A
Let
me
just
end
with
this
last
science
story
here:
accelerating
some
fluid
dynamics,
applications
with
gans
on
promoter.
So
I
know
a
number
of
people
in
the
audience
are
probably
hoping
to
do
some
machine
learning,
either
training
or
inference
on
the
promoter
system.
So
this
is
a
case
where
a
group
is
replacing
part
of
a
fluid
dynamics
simulation
with
a
with
a
gan
or
kind
of
a
trained
neural
network.
A
A
You
kind
of
apply
your
machine
learning
workflows
to
to
permuter
okay,
so
let
me
conclude
so
I
think
the
key
takeaways
is
that
we've
kind
of
you
know
nursed
as
a
whole,
has
been
successful
in
preparing
a
number
of
applications
for
perlmutter.
A
One
of
the
things
I
want
to
highlight
to
this
audience
is
that
we
really
want
to
continue
working
with
you
to
enable
the
use
of
pro
mudder
productively,
and
I
think,
really
really
one
of
the
best
ways
of
doing
that
is
to
have
you
all
apply
and
join
these
public
events
that
are
well.
I
guess
they're
virtual
for
now,
but
they're
kind
of
being
led
by
institutions
all
around
the
country
and
really
all
around
the
world
so
check
out
gpu,
hackathons.org
and
that
gpu
optimizations.
A
You
know
that
we
talked
about
increasing
parallels
and
understanding
and
minimizing
the
the
data
movement
are
things
that
really
continue
on
the
themes
from
corey
and
are
really
the
most
important
things
that
you
need
to
still
think
about
when
optimizing
your
applications
for
for
promoter
and
then,
finally,
I
think
you're
going
to
hear
a
lot
about
a
lot
more
about
this
in
the
next
talk.
A
But
I
really
think
that
you
know
openmp
and
then
the
c
plus
plus
frameworks,
for
example,
are
becoming
viable
options
for
you
to
utilize,
not
just
pro
model
productively,
but
also
the
upcoming
exascale
systems
at
the
other
doe
doe
facilities.