►
From YouTube: 02 Intro to Perlmutter and GPUs
Description
Part of the Migrating from Cori to Perlmutter Training, December 1, 2022.
Please see https://www.nersc.gov/users/training/events/migrating-from-cori-to-perlmutter-training-dec2022/ for the training day agenda and presentation slides.
A
Hi
everybody
I
am
the
application
performance
lead
at
nurse
and
so
I'm,
going
to
kind
of
give
you
an
introduction
to
the
promoter
system
a
little
bit
about
programming
gpus
and
like
the
different
programming
models
and
languages,
libraries
Frameworks
that
are
available
on
promoter
and
I.
Think
we'll
close
here
with
just
a
few
early
science
stories
to
maybe
motivate
the
the
audience
with
what
what
can
and
kind
of
has
been
done
on
the
on
the
system.
A
Already
I
think
as
Helen
pointed
out,
if
you
have
questions
put
them
in
the
in
the
Google,
Doc
I
think
we're
doing
that,
because
then
the
questions
kind
of
stay
after
the
after
the
zoom
meeting
is
is
over
okay.
So
let
me
kind
of
put
promoter
in
perspective.
A
So
this
is
the
nurse
system
roadmap,
and
one
of
the
things
I
want
to
highlight
here
is
that
we're
we're
kind
of
in
this
transition
from
what
was
kind
of
pretty
typical
HPC
systems
a
decade
ago,
with
Edison
being
a
kind
of
a
multi-core
cpu-based
system
replicated
across
you
know
thousands
of
nodes
to
to
make
an
HPC
an
HPC
architecture.
We've
started
this
transition
towards
sort
of
energy
efficient,
exascale
like
architectures.
In
order
to
kind
of
Meet
the
demands
of
the
community.
A
We
started
that
transition
with
Corey,
which
was
powered
by
Intel
Xeon,
5
processors
and
then
with
Pro
Mudder.
We
have
our
first
ever
GPU
accelerated
system
and
we're
thinking
we're
kind
of
already
beginning
the
process
to
procure
the
nurse
10
system,
which
I
think
is
expected
in,
like
the
2025-2026
2026
time
frame.
There's
not
much.
We
can
say
about
that,
but
I
think
we're
expecting
this
trend
towards
these
energy
efficient
architectures
to
continue.
A
So,
if
you
look
at
Pearl
Mudder,
we
have
two
types
of
nodes:
we
have
first,
the
Nvidia
amdia
ampere
I
should
say
GPU
powered
nodes,
Each
of
which
has
four
gpus
and
one
CPU
per
per
node,
and
they
have
a
tremendous
amount
of
performance
in
the
in
the
node.
You
can
see
here
over
75,
I,
guess,
I,
guess
that
says
teraflops
but
I
think
that's
a
lower
number.
A
And
then
we
have
these
AMD
Milan
CPU
nodes
which
are
don't
have
the
gpus,
but
do
have
two
CPUs
256
gigabytes
of
DDR.
So
a
little
bit
more
DDR
per
node,
but
you
don't
have
the
high
bandwidth
memory
that
comes
with
the
with
the
gpus
on
those
the
the
system
as
a
whole
has
1500
GPU
nodes.
A
Again
those
are
nodes
with
one
CPU,
the
AMD
Milan
CPU
and
four
Nvidia
a100
gpus,
and
then
it
has
3
000
CPU
only
nodes,
and
it
may
seem
like
there's
a
lot
more
CP
CPU
nodes
and
GPU
nodes,
but
keep
in
mind
that
most
of
the
actual
performance
of
the
system
or
like
most
of
the
toll
available
flops,
come
from
those
GPU
powered
powered
nodes
oops.
A
So
you
can
see
that
that's
sort
of
here
in
terms
of
the
in
terms
of
the
performance.
If
you
look
at
the
performance
of
this
of
the
GPU
nodes,
you
have,
you
know
about
120,
total
petal,
flops
I.
Think
that's
the
right
number.
A
If
you
include
the
the
power
and
capability
of
the
tensor
cores
within
the
within
the
gpus
compared
to
about
you,
know
close
to
I,
guess
eight
petaflops
for
the
for
the
CPU
nodes.
This
top
row
here,
where
my
mouse
is,
is
showing
the
performance
of
the
CPUs
that
are
within
the
GPU
node.
So
if
you
ignore
the
gpus,
you
have
another
a
little
bit
of
a
about
about
four,
more
petaflops
of
performance.
A
It's
all
downstairs
in
the
building
that
I'm
talking
to
you
from,
and
you
can
find
even
more
details
about
the
the
architecture
and
the
the
different
components
at
this
URL
here
I'm
going
to
talk
a
little
bit
more
just
about
the
different
about
the
two
types
of
nodes.
So,
as
I
mentioned,
most
of
the
performance
on
the
system
comes
from
the
GPU
notes
and
I
mentioned
that
you
have
one
AMD
Milan
processor.
So
that's
pictured
here
with
four
Nvidia
a100
gpus.
Those
are
the
four
ampere
gpus
pictured
here.
A
A
A
You
know,
typically
on
the
order
of
like
maybe
16
gigabytes
per
per
GPU,
and
that
the
one
of
the
important
things
about
the
gpus
is
that
that
memory
comes
with
very,
very
high
bandwidth,
so
you're
able
to
achieve
you
know
close
to
1600
gigabytes
per
second
of
GPU
bandwidth
and,
in
importantly,
as
I'm,
going
to
kind
of
discuss
in
a
minute
that
bandwidth
is
much
higher
than
what
you
get
by
compared
to
what
you
get
by
moving
data
across
that
PCI
Express
Bus
between
the
CPU
and
the
and
the
gpus
okay.
A
So
then,
the
the
CPU
node
here
looks
a
little
bit
simpler.
So
you
have
two
AMD
Milan
processors.
A
Each
one
of
those
is
a
64
core
part.
They
support
AVX,
2
instructions
similar,
but
not
quite
as
high
of
a
of
a
vector
width,
as
you
had
on
the
Intel
Xeon
5
processor
for
Corey,
and
they
do
have
relatively
High
memory
bandwidth
for
C
for
c
for
CPU.
You
see
that
you
have
204
gigabytes
per
second
memory
bandwidth
but
of
course
that's
significantly
lower
than
that
memory,
bandwidth
that
you
saw
on
the
on
the
gpus.
A
A
It
can
support
an
angry,
get
bandwidth
up
to
five
terabytes
per
second
five
million
High
Ops-
and
you
know
here-
are
some
of
the
other
characteristics
of
the
parallel
file
system,
including
the
number
of
metadata
servers
in
IO
servers,
unlike
with
Corey,
where
you
had
this
sort
of
spinning
disc
based
flash
Spinning
Disk
based
scratch
file
system
combined
with
a
burst
buffer,
just
everything
on
promoter
or
scratch
or
is
Flash,
which
we
think
is
you
know
a
kind
of
a
nice
usability
Improvement.
A
So
one
of
the
things
that
I
want
to
highlight
here
is
is
that
we,
together
with
you
all
and
the
community
kind
of,
have
this
common
challenge,
which
is
to
enable
all
of
these
different
science
users
and
codes
to
run
efficiently
on
these
Advanced
architectures,
including
Corey
and
now
moving
on
to
Pro
Mudder
and
then
eventually
to
the
nurse
10
system,
and
here
is
sort
of
like
a
side-by-side
comparison
of
what
that
challenge
looks
like
in
this
generation
moving
from
Corey
the
Quarry
system
to
promoters.
A
So
we
we're
moving
up
significantly
in
the
total
capability
of
the
system.
Significantly
more
memory,
one
of
the
real
big
differences
is
the
performance
per
node
has
gone
up
dramatically
on
promoter
compared
to
Corey.
These
nodes
are
much
much
more
dense
in
terms
of
their
compute
power.
You
see
from
about
three
teraflops
to
70
teraflops,
the
the
no
performance
or
the
the
processors
now
include
the
accelerator.
A
The
number
of
nodes
is
actually
shrunk,
I
think
and
that's
consistent
with
the
the
nodes
themselves
being
more
powerful,
and
the
other
thing
I
would
highlight
is
this
old
flash
file
system
I?
Think,
actually
is
one
of
the
things
that
makes
running
on
promoter.
Even
you
know,
perhaps
eat
a
little
bit
easier
than
than
it
was
on
on
Corey.
A
So
we
we
kind
of
started
this
process
of
when
you,
when
we
were
thinking
about
kind
of
procuring
a
GPU
system.
We
were
looking
at
our
workload
and
and
asking
ourselves
what
fraction
of
the
workload
could
really
take
advantage
of
the
gpus,
and
this
is
where
things
sort
of
stood
in
the
2017-2018
time
frame
and
the
good
news
is
because
gpus
had
been
around
at
a
number
of
different
facilities
and
have
been
used
different
places
throughout
the
world.
A
The
number
of
codes
had
a
GPU
versions
available
already,
but
a
number
of
applications
we're
also
kind
of
only
partly
ported
or
hadn't,
even
even
kind
of
started
in
in
some
cases,
and
so
we've
been
working
with
these
code
teams.
You
know
with
a
number
of
code
teams
over
the
last,
you
know
five
five
five
years
or
so
to
make
sure
that
they
would
be
ready
for
for
promoter,
and
you
know
some
of
what
I
want
to
tell
you
about.
A
Today
is
what
we've
learned
from
that
process
and
and
sort
of
what
some
of
those
lessons
are
that
you
can
kind
of
learn,
learn
from
as
well
so
in
in
general,
as
you're
thinking
about
a
CPU
to
GPU
transition
for
an
application.
A
One
of
the
ways
to
think
about
this
is
through
this
analogy
that
I'm,
showing
on
this
slide
that
a
you
know
a
CPU
is
something
like
a
Ferrari
I
think
that
I
think
this
is
supposed
to
be
a
Ferrari
on
this
on
this
picture
and
that's
like
it's
a
car
that
it
can
go
really
fast.
Make
really
tight
turns,
but
is
really
good
for
kind
of
doing
one
one
task
at
a
time
or
taking
one
person
to
kind
of
one
place
at
a
time
where
a
GPU
is
something
like.
A
You
know
this
double
decker
bus,
which
is
kind
of
good
at
going
taking
a
lot
a
lot
of
different
people
to
one
place,
not
as
fast
as
the
CPU
would
take
an
individual,
but
with
a
higher
overall
through
throughput,
and
this
is
kind
of
evident
if
you
compare
the
amount
of
parallelism
that
is
available
in
a
CPU
from
our
Quarry
system.
So
this
is
the
the
Corey
Haswell
partition
compared
to
a
GPU
on
Chrome
letter.
A
So
if
you
think
about
the
two
sockets
of
a
one
of
the
Corey
Haswell
nodes,
you
have
64
compute
cores
each
one
of
those
can
support
up
to
two
threads
via
Intel's
hyper
threading
technology.
A
And
if
you
were
to
use
those
avx2
instructions,
you
can
compute
on
two
by
256
bit
vectors
at
a
time,
so
that
all
adds
up
to
about
2
000
way,
parallelism
that
a
Haswell,
CPU
node
is
sort
of
capable
of.
If
you
compare
that
to
a
single
a100
GPU,
the
equivalent
of
the
64
cores
is
approx
is
is
basically
these
108
SMS
or
what
they
call
streaming
multi-processors
on
a
GPU
and
each
one
of
those
can
support
up
to
64.
A
Warps
per
SM
only
two
can
be
active
at
a
time,
but
you
really
want
to
generally
over
subscribe,
the
number
of
warps
per
SM
to
to
keep
the
the
GPU
busy
and
then
within
each
one
of
those
warps.
A
So
if
we
do,
the
math
here
that
adds
up
to
I
guess
multiplies
up
to
200
000
way,
parallelism,
which
is,
of
course
like
an
order
of
magnitude
bigger
than
you
know,
I
guess
two
orders
of
magnitude
bigger
than
what
you
see
on
the
on
the
CPU
node,
and
this
is
what
I
I
had
said
verbally-
is
that
you
typically
want
to
over
subscribe
to
gpus
in
order
to
keep
keep
them
busy
and
hide
any
latency.
A
So
another
big
difference
between
a
the
CPU
and
the
GPU
is
the
memory
bandwidth.
So
if
you,
if
you
look
at
the
Haswell
CPU
you,
we
had
128
gigabytes
of
DDR
per
node
and
approximately
128
gigabytes
per
second
of
memory,
bandwidth
on
that
Haswell
CPU
node.
Now,
if
you
compare
that
to
a
single
a100
GPU
again,
we
have
40
gigabytes
of
high
band
with
memory,
but
significantly
higher
memory.
Bandwidth
so
1500
gigabytes
a
second,
so
an
increase
in
an
order
of
magnitude.
A
Again
now,
as
I
was
highlighting
when
I
was
kind
of
talking
about
the
architecture,
this
should
be
compared
against
the
speed
of
moving
data
between
the
CPU
and
GPU
across
the
PCI
Express
Bus,
and
that
is
about
32
gigabytes
a
second,
and
so
you
can
see
you
can
move
data
within
the
GPU
very,
very
fast,
but
moving
data
across
the
CPU
to
GPU
can
be
very
slow.
So
the
lesson
learned
here
is
to
try
to
avoid
moving
data
back
and
forth
frequently
and
in
general.
A
The
challenge
for
optimizing,
an
application
or
bringing
an
application
to
the
GPU
is
that
there
can
be
kind
of
multiple
GPU,
optimization
Avenues.
A
So
the
two
themes
I
just
highlighted
are
you
need
orders
of
magnitude,
more
parallelism
and
you
need
to
recognize
that,
while
the
GPU
memory
is
very
fast,
moving
data
back
and
forth
is
very
slow
and
then
there
are
of
course
other
like
second
order
considerations
here,
like
just
kind
of
two
examples
in
that
there
is
some
overhead
in
launching
kernels
or
like
bits
of
code
to
run
on
the
GPU,
and
so
you
want
to
make
sure
that
you're
giving
the
GPU
kind
of
enough
contiguous
work
to
to
work
on,
and
even
though
the
memory
on
the
GPU
is
fast,
you
still
want
to
take
advantage
of
cash
and
registers
and
shared
memory
as
much
as
possible.
A
So
we've
kind
of
realized
that
this
can
be
kind
of
like
a
multi-access
multi,
a
problem
that
kind
of
has
multiple
accesses
that
you
need
to
kind
of
understand
your
performance
on,
and
so
one
of
the
things
that
we've
been
working
with
our
vendor
Partners
on
is
putting
together
some
tools
that
help
you
kind
of
quickly
profile
your
application
and
understand.
A
What's
limiting
your
performance,
so
we've
worked
with
Nvidia
in
particular
on
their
Insight
profiling
tool,
and
it
now
integrates
the
what
we
call
the
roofline
performance
model,
which
will
kind
of
plot
your
application
performance
on
this
sort
of
two-dimensional
plot.
A
That
considers
your
kind
of
the
characteristics
of
your
algorithm
in
terms
of
the
data
movement
versus
the
compute
required
and
shows
you
kind
of
where
you
stand
against
what
we
would
expect
for
that
particular
for
an
algorithm
of
that
characteristic
and
from
from
there
you
can
then
kind
of
devise
a
strategy
for
improving
your
overall
performance,
and
so
this
is
something
that's
baked
into
the
tools
that
you
can
use
here
now
at
nurse.
A
So
our
strategy
over
the
last
several
years
as
part
of
this
program
called
nisap,
which
stands
for
the
nurse
exascale
science
application
program,
has
been
to
partner,
with
a
number
of
application
teams
to
work
with
them
to
prepare
or
Pearl
Mudder
at
a
pretty
at
a
pretty
deep
level
and
then
to
take
what
we've
learned
there
and
kind
of
share
with
you
the
share,
with
kind
of
everybody
in
the
community.
A
A
This
I,
just
wanted
to
kind
of
highlight,
was
an
all
hands
on
deck
activity.
A
number
of
the
people
who
you
see
here
will
be
kind
of
available
in
the
afternoon
to
work
with
you
and
to
kind
of
talk
with
you
about
your
own
applications,
and
one
of
the
things
I
really
want
to
highlight
here
is
that
these
hackathons
have
been
really
effective
in
kind
of
helping
improve
various
code
teams
around
the
country.
A
The
we've
had
sort
of
two
types.
The
first
type
is
sort
of
now
now
wrapping
up.
A
It
was
kind
of
part
of
the
nurse
project
itself,
but
the
second
one
is
the
these
public
GPU
hackathons
that
anybody
and
everybody
can
go
apply
to
become
a
team
at
at
this
URL,
and
you
know,
we've
actually
provided
more
mentors,
I
think
than
any
other
institution
in
the
world
to
these
to
these
hackathons
and
so
we'd
love
to
kind
of
actually
work
with
you
at
hackathons
all
around
the
all
around
the
country
and,
to
a
certain
extent
all
around
the
world.
A
So
these
are
organized
by
Nvidia,
Oak,
Ridge,
National,
Lab
and
and
us
here
at
nurse
and
I,
think
there's,
there's
I
think
probably
on
the
average
of
one
a
month
within
like
North
America
and
so
really.
A
One
of
the
one
of
my
kind
of
take-home
messages
today
is
go
check
out
this
URL
and
if
you
think,
you
really
want
to
Deep
dive
into
your
application
and
optimizing
it
for
gpus
I
think
these
can
be
really
really
good
good
events
and
so
just
an
example
of
how
this
worked
with
like
one
of
these
one
of
these
applications.
A
A
You
can
see
their
performance
here,
sort
of
over
time
as
they
were
working
on
this
application
and
optimizing
it
towards
gpus
and
there
one
of
the
things
I
want
to
highlight
is
sort
of
like
the
speed
ups
that
they
obtain
are
sort
of
centered
around
these
different
hackathon
events,
where
they
were
able
to
make
a
significant
difference
in
just
a
few
days
by
attending
a
hackathon
and
then
continuing
on
those
improvements
and
then
attending
another
hackathon
and
making
significantly
more
more
improvements,
and
this
actually
led
the
team
to
do
some
really
large
scale.
A
Science
runs
on
both
Pearl
Mudder
and
other
other
GPU
systems
available.
A
That
I,
don't
think,
really
could
have
been
done
without
the
without
this
new
system
itself
and
the
work
that
the
team
put
into
the
to
the
to
the
project
so
yeah,
so
the
one
you
know,
one
of
the
other
things
I
wanted
to
highlight
is
that
we've
been
working
with
teams
to
do
really
large
scale.
Science
runs
on
Pro,
Mudder
and
related
GPU
systems.
Over
the
last.
A
Several
years
and
one
of
the
outcomes
is
these
really
kind
of
large-scale
state-of-the-art
science
calculations
that
are
recognized
each
year
at
the
super
Computing
conference
as
Gordon
Bell
prize
finalists
or
in
some
cases
in
some
cases,
winners.
A
So
just
some
observations
about
this
process.
This
is
I
think
that
you
know
many
applications
have
been
successful
in
preparing
for
prom
letter
and
we'd
really
like
to
keep
engaging
with
with
you
all
in
the
community
to
enable
you
all
to
basically
use
the
system
productively,
and
we
really
do
encourage
everyone
to
join
these
Community
hackathons,
GPU,
hackathons.org
I
think
that's
just
a
great
way
to
to
get
a
lot
done
done
quickly,
and
we
do
kind
of
recognize
that
they're,
you
know
moving
you
know
optimizing.
A
Your
application
for
gpus
is
not
a
kind
of
linear
one-size-fits-all
solution,
there's
kind
of
multiple
optimization
angles
that
exist
and
kind
of
profiling
and
using
the
roofline
tool
that
I
highlighted
is
a
great
way
to
get
started.
A
The
the
other
thing
that
I
would
just
note
here
at
the
end
is
that
I
I
really
think
that
we've
seen
a
lot
of
energy
coming
from
you
all.
The
community,
who
is
really
excited
about
the
potential
of
of
pro
Mudder
and
I,
think
that's
really
great
to
see,
and
it's
something
that
we
are
really
really
excited
about
as
well,
so
I
want
to
now
change
so
I
think
I'm
kind
of
moving
into
the
second
part
of
this.
A
This
talk
and
I
think
I'll
try
to
not
go
over
my
time
too
much
here
and
I
just
want
to
talk
about
some
of
like
the
the
programming
environment
that
is
available
on
promoter
for
everybody
to
to
use,
and
so
one
of
the
things
that
I
want
to
highlight
about
promoter
is
that
compared
to
some
of
the
other
GPU
systems
out
there
that
are
not
that
with
using
gpus
from
vendors
that
are
not
quite
as
mature,
maybe
as
the
Nvidia,
the
Nvidia
Parts.
A
We
have
essentially
support
for
every
single
GPU
programming
model
out
there
on
on
Pro
monitor.
So
we
support
Fortran,
and
we
recognize
that
some
of
the
applications
are
written
in
Cuda
Fortran
and
you
can
use
those
on
Pro
Mudder.
A
We
realize
that
a
lot
of
applications
out
there
written
in
Cuda
and
you
can
use
those
on
Pro,
Mudder
and
and
also
open
ACC,
for
example,
like
basp,
is
a
is
an
important
application,
that's
written
in
open
ACC,
and
then
we
also
realize
that
people
have
invested
a
lot
in
open
NP
in
their
applications
for
Corey,
and
one
of
the
things
that
we
are
happy
with
you
know
happy
to
to
say
about
promoter
is
that
you
can
kind
of
transition,
those
openmp
codes
and
Target
the
gpus
with
the
new
openmp5.x
standard
and
then
for
more
kind
of
modern,
c
c,
C,
plus
plus
applications.
A
You
could
use
these
Frameworks
like
Cocos,
Raja
and
even
DPC,
plus
plus
and
sickle
to
run
on
the
on
the
system
as
well.
A
We
also
have
a
pretty
robust
programming
environment
around
data
and
Analytics
Helen
had
highlighted
using
Jupiter
to
log
on
to
the
system
and,
of
course,
Python
and
the
Nvidia
rapid
stack
are
supported,
and
you
know
pytorch
and
tensorflow.
Two
of
the
the
the
more
important
machine
learning
Frameworks
are
also
really
well
optimized
for
the
system.
We
have
a
set
of
debuggers
and
profilers
that
I
highlight
here
in
particular,
including
the
insight
profiler
that
I
that
I
talked
about
earlier
in
regards
to
the
roof
line
performance
modeling.
A
We
have
a
really
growing
segment
of
our
user
base,
that
is
using
python
and
so,
through
collaboration
with
with
Nvidia
and
hpe,
we've
been
working
to
make
sure
that
getting
you
know,
performant
python
acceleration
with
the
gpus
is,
is
available
on
the
system,
and
here
are
some
of
the
libraries
that
you
can
use
to
do
that,
including
Pi
torch
and
tensorflow
again
for
AI
or
machine
learning,
applications
and
I
I
might
just
skip
this
slide.
A
I
think
because
I
think
Helen
mentioned
it
in
in
her
her
deck,
but
really
quickly.
You
can
definitely
run
Jupiter
notebooks
on
promoter
in
a
whole.
Bunch
of
different
configurations,
including
you
know,
a
shared
CPU,
node
or
exclusive
GPU
access
as
well.
A
So,
as
I
said,
I
think
we've
we're
trying
to
take
this
sort
of
pragmatic
approach
where
we
recognize
that
there's
a
lot
of
users
out
there
who
have
existing
GPU
codes
and
we
want
to
kind
of
just
meet
them
where
they,
where
they
are
and
allow
them
to
run
those
codes
performingly
on
the
system,
and
we
also
want
to
at
the
same
time
promote
some
sort
of
performance,
portable
programming
models
that
we
think
will
give
your
code
a
little
bit
more
longevity
or
long
kind
of
longer
legs
going
forward
and
those
include
openmp.
A
You
know
4.5
5.x
support
as
well
as
things
like
Coco
switch,
are
big
Investments
within
the
doe
to
allow
C
plus
applications
to
run
on
CPUs
and
gpus,
from
multiple
vendors
and
and
also
into
the
into
the
future.
So
our
strategy
has
really
been,
as
I
said,
to
kind
of
support,
all
those
major
programming
models
and
languages
to
also
pre-install
optimized
versions
of
many
of
your
kind
of
favorite
applications.
A
You
know,
particularly
in
the
material
science
and
chemistry
space.
There
are
a
lot
of
shared
applications,
codes
like
vasp
and
lamps,
which
we
talked
about
and
Quantum
espresso,
as
well
as
working
with
the
vendors
to
kind
of
make
the
process
of
understanding
your
performance
on
gpus,
something
that
is
a
little
bit
more
tractable,
and
you
know
I
just
want
to
highlight
again
how
I
think
how
useful
these
GPU
hackathons
can
be.
A
If
you
go
to
GPU
hackathons.org
again,
you
can
register
for
an
upcoming
an
upcoming
event,
so
we
in
particular,
have
kind
of
invested
some
of
our
own
time
and
resources
into
enabling
a
couple
performance.
Portable
programming
models,
the
one
that
I
really
want
to
highlight
here
is
openmp
4.5
and
5.x,
which
has
gained
accelerator
support
in
the
Nvidia
HPC
compiler
stack
because
of
nurse
collaboration
with
with
Nvidia.
A
What
we
did
was
we
kind
of
settled
on
a
well-defined
subset
of
the
openmp
standard
for
optimized
GPU
acceleration,
and
this
has
now
been
released
in
production
in
the
Nvidia
compiler
stack
that
is
available
on
Pro
Motors.
You
can
basically
use
it
use
it
today,
and
so
I
will
close
here.
A
I
think
with
just
maybe
a
couple
examples
or
science
examples
from
from
under
I
know
that
I'm
just
about
out
of
time
here
so
I'll,
maybe
just
do
one
one
or
two,
so
this
is
actually
an
application,
that's
sort
of
near
and
dear
to
my
heart.
A
This
is
a
material
science
application
where
the
goal
here
is
to
kind
of
help,
design
potential
future
qubits
for
like
a
Quantum
quantum
computer
in
complex
materials
that
have
some
sort
of
defect
in
them.
So
in
this,
in
this
case
the
defect
is,
what's
called
a
die
vacancy,
so
essentially,
two
atoms
are
removed
from
the
the
Crystal
and
to
understand
the
kind
of
quantum
states
around
that
that
defect,
and
so
to
do
that
they
needed
this.
A
This
team
needed
to
simulate
sort
of
unprecedented
simulation
sizes
with
a
with
thousands
of
of
atoms
and
kind
of
here
you
can
see
some
of
the
some
of
the
results.
A
You
can
see
kind
of
performance
improvements
as
the
gpus
have
evolved
into
the
final
GPU
part
for
for
promoter,
and
you
can
see
the
that
the
scaling
to
essentially
a
full
GPU
system,
like
like
Pearl
Mudder
in
the
in
in
the
plot
above
another
example,
is
exobiome.
So
this
is
a
metagenomics
code
where
they
basically
take
and
analyze.
A
The
Genome
of
an
entire,
like
population
from
a
you
know,
could
could
be
like
a
sort
of
like
a
chunk
of
dirt
or
inside
the
gut
of
a
of
an
animal
and
there's
kind
of
a
thriving
ecosystem
of
different
types
of,
like
viruses,
bacteria
that
you
find,
and
they
want
us
kind
of
sequence,
that
entire
population,
and
so
they
have
a
lot
of
sort
of
challenging
operations
to
separate
analyze
and
assemble
that
that
genome.
A
And
you
can
see
that
even
with
a
set
of
work
that
isn't
kind
of
immediately
amenable
to
gpus.
They've
been
able
to
make
significant
improvements.
And
what
you're
seeing
here
is
a
comparison
of
the
CPU
versus
the
GPU
performance
on
on
promoter
for
for
their
particular
algorithms
and
I.
Think
maybe,
let's
see
if
I'm
just
going
to
highlight,
maybe
one
more
here,
which
is
some
of
the
early
successes
we've
had
working
with
different
facilities,
in
particular
some
of
the
light
sources
and
the
kind
of
astrophysics
or
observational
facilities.
A
This
includes
the
lcls
at
Stanford
and
cem
here
at
Berkeley
lab
and
the
dark
energy
spectroscopic
instrument
Desi
and
the
LZ
projects
as
well.
All
of
these
teams
are
kind
of
up
and
running
on
on
Pro
Mudder
and
have
sort
of
stories
that
you
can
read
about
on
our
on
our
on
our
website.