►
From YouTube: Introduction to Perlmutter System
Description
Jay Srinivasan (NERSC)
Introduction to Perlmutter System
A
Thank
you
Jack,
and
thank
you
to
Neil
and
Helen,
and
all
the
other
organizers
for
putting
this
together,
I
think
it's
yeah.
It's
it's
very
interesting
to
to
see
the
progression
of
this.
You
know
we
we
sort
of
put
together
the
idea
for
Paul
Miller
back
in
in
2015.
and
the
the
sort
of
landscape
of
GP
of
nurse.
A
A
different
bank
do
you,
have
any
and
and
the
overall
push
getting
our
user
base
ready
for
gpus
has
been
really
spectacularly
successful,
and
so
let
me
give
a
really
short,
and
none
of
this
will
be
surprising
to
you
all,
but
just
set
the
tone
here.
A
Okay-
and
you
can
all
see
the
slides
foreign.
A
Obviously,
it's
very
fitting
to
talk
about
Promoter
on
gpus
for
science,
because
it
does
actually
provide
gpus
for
science,
and
so
you
know
in
the
nurse
sort
of
Pantheon
of
systems
promoters
right
there
in
the
middle,
at
least
of
this
slide.
A
So
you
know,
starting
in
in
2021,
we
started
deploying
parts
to
promoter
and
and
actually
we
made
the
GPU
accelerated
nodes,
the
very
first
things
that
were
available
to
our
staff
and
then
to
some
other
users,
and
then
all
users,
and
so
typically,
as
you
can
see,
you
know
from
the
progression
here.
A
We
have
a
new
system
every
few
every
few
years
and
nurse
nine
and
is
in
some
ways
new
and
it's
in
some
ways
a
continuation
of
what
we
initiated
with
Corey,
where
you
know
there
were
many
core
CPUs
and
the
sort
of
parallelism
and
things
like
that
that
in
the
advanced
architectures
that
are
kicked
off
in
that
era,
continued
on
with
with
permanent.
A
So
at
a
very
high
level,
like
Jack
said
it
is
a
system,
that's
been
optimized
for
for
science
right
and
so
by
that
we
need.
We
don't
just
have
one
kind
of
thing
on
the
system.
We
have
in
fact
many
different
things
and
we
we
make
sure
that
they
try
to
work
well
together
right,
so
we
have
CPU
only
nodes
and
we
have
GPU
accelerated
nodes,
which
I'm
sure
are
very
interesting
to
all
of
you.
We
have
an
all
flash
high
performance,
all
flash
file
system
and
then
from
the
user.
A
You
know
the
first
things
that
the
user
sees
on
on
the
system
or
even
getting
used
to
the
system
or-
or
you
know
things
like-
either
the
login
nodes
or
or
nodes
that
aren't
necessarily
optimized
for
for,
for
you
know,
high
performance
compute,
but
provide
the
basic
foundational
blocks
to
allow
you
to
do
to
compose
your
workflows.
To
do.
A
A
But
then,
in
addition
to
sort
of
bring
together
the
nurse
environment,
we
we
sort
of
make
sure
that
you
know
external
file
systems
and
networks
can
connect
well
and
have
a
good
path
into
the
system
right
and
so
a
little
bit
more
detail
on
this,
which
I
think
hopefully
should
be
familiar
to
people
who've.
Actually,
you
know
been
on
the
system
now.
A
You
know
we
can
sort
of
broadly
divide
the
system
into
sort
of
three
bits
right.
One
is
all
of
these
gray
boxes,
which
are
the
supporting
environment
right.
A
So
the
log
access
nodes
service
nodes
how
we
manage
the
system
is
using
a
kubernetes
system
management
orchestration-
and
it's
not
you
know
so
that
that
isn't
directly
visible
to
the
users,
but
it
does
have
some
features:
that'll
make
the
user
environment
on,
and
the
access
to
the
system
much
more
useful
and
interesting,
and
then
we
have
the
compute
portion,
which
is
right
now,
like
I,
said
composed
of
GPU
axle,
rated
nodes
and
CPU
nodes,
and
then
all
of
this
is
sort
of
tied
together
with
these
slingshot
switches.
A
So
slingshot
is
the
high
bandwidth
low
latency
network,
but
hpe
has
put
together
for
this
system,
and
that
is
in
fact
you
know
what
allows
promoter
to
be
and
I'll
show
you
the
differences
from
Quarry,
but
all
of
this
is
under
one
network
for
for
this
system
right.
In
addition,
you
know,
there's
not
a
lot
of
point
in
having
a
system
that
nobody
can
sort
of
get
to
and
do
interesting
things
on.
A
So
we
have
a
very
resilient
and
high
band
with
the
link
to
the
nearest
Network
and
the
world
right.
So
this
this
these
this
picture
here
on
the
lower
left,
shows
you
that
you
know
we
have
we're
promoter
all
of
this
slingshot
Network,
and
then
we
have
a
very
AI
bandwidth
connection
to
the
edge
of
from
under
and
that's
resilient,
and
so
this
is
multi-terabits
per
second
and
then
from
The
Edge
router.
A
A
All
of
the
other
stuff,
that's
outside
of
prolander,
let's
see
so
in
terms
of
the
specific
Hardware
you
know,
obviously,
for
for
the
gpus
portion,
the
most
interesting
one
is
the
GPU
axle
rating
nodes,
and
these
have
four
Nvidia
a100
gpus
per
note
right,
and
so
you
can
actually
see
them
over
here
on
this
little
picture
here,
which
is
the
natural
motherboard.
A
A
In
addition
to
actually
drive
the
node
since
the
gpus
are
not
yet
able
to
do
that,
we
have
an
AMD
Milan
to
do
that,
we
also
have
on
the
GPU
nodes
dram
right,
so
we
have
256
gigabytes
of
demamp,
and
you
can
actually
see
that
here
these
between
these
copper
fins
and
then
because
we
have
four
gpus.
We
have
four
slingshot
Nicks
per
node
and
then
on
the
CPU
only
nodes.
A
We
have
just
the
two
CPUs
and
with
because
we
then
have
a
lot
more
real
estate
on
the
motherboard.
We
are
able
to
give
you
512
gigabytes
of
dram
per
CPU
node
and
only
one
slingshot
Network.
A
So
then
the
gpus
themselves
are
hooked
up
together
like
this
and
then
they're
connected
to
the
to
the
rest
of
the
node
here
by
this
picture
on
the
top
left,
and
you
can
see
that
that
they
have
PCI
connections
to
the
chip
to
the
Nic
and
then
to
the
outside
world
and
then
amongst
themselves,
send
me
a
link
so
from
a
system
point
of
view.
A
Let
me
just
take
a
minute
here
and
talk
about
the
the
features
Shasta
is
the
system
or
management
framework
stack
that
we
have
on
the
nodes,
and
so
this
is
a
sort
of
a
typical
picture
here
you
can
see,
but
there
are
some
differences
to
standard
sort
of
large
clusters
or
large
supercomputers.
We
have
a
whole
bunch
of
non-compute
nodes
on
the
left,
and
then
we
have
all
the
compute
nodes
on
the
right.
The
non-compute
nodes
in
this
case
are
in
fact
managed
with
this
Cloud
managed
infrastructure
right.
A
So
this
is
very
similar
to
what
the
large
very
large
Cloud
providers
use,
and
there
are
some
certain
benefits
to
to
doing
it
at
this
scale
at
the
scale
we
have
as
well.
It
gives
you
sort
of
the
service
oriented
architecture
that
allows
us
to
use
a
lot
of
the
new
developments
and
sort
of
system,
management,
capabilities,
and
so
on
that
are
that
are
out
there
to
manage
the
system.
The
compute
nodes
themselves
are
about
off
of
bare
metal,
so
there
aren't
any
directly
user
accessible.
A
You
know,
Cloud
managed
so
Cloud
oriented
features
on
there,
but
there
is
a
bunch
of
sort
of
value
that
we
can
leverage
from
this
Cloud
thing
right.
So
we
can.
We
have
a.
We
have
a
system-wide
API
that
users
can
get
access
to
to
help
control
some
aspects
of
of
their
workflows
and
and
then
it
all
of
the
services
themselves
are
resilient
when
we,
when
we
put
them
on
this
on
this
framework
here
right,
so
both
so
at
a
very
high
level.
A
The
the
the
sort
of
user
environment
nodes,
which
are
the
sort
of
worker
nodes
and
login
nodes,
are
all
controlled
using
this
kubernetes
and
the
compute
nodes
themselves
are
not
directly
controlled
kubernetes,
but
the
entire
orchestration
process,
this
control
music.
A
So
in
terms
of
differences
from
Corey
I
mentioned
I,
think
people
are
familiar
with
Corey
again.
The
main
thing
is
all
of
promoter,
which
is
this
middle
tank
here
on
the
left
on
this
slide
is
control
is
under
one
slingshot,
Network
right
so,
but
as
as
compared
to
Corey,
where
you
had
you
know,
you
have
the
two
partitions,
the
Haswell
and
the
p
l
partitions,
those
along
with
the
associated
service
nodes
and
the
you
know
the
booting
nodes
and
as
well
as
the
nodes
that
help
you
get.
A
Access
to
the
file
system
are
all
part
of
the
Aries
network,
but
the
access
nodes,
as
well
as
the
the
storage
itself,
is
on
a
different
network.
So
there's
a
network
translation
happening
at
some
level
for
for
Quarry
and
between,
say
the
the
K
L's
and
the
I
o
nodes
or
kennels,
and
the
paths
to
the
outside
world,
whereas
in
slingshot
everything
is
on
one
network,
and
you
can
see
this
picture
here
on
the
far
right.
That
shows
you.
A
Those
elect
those
network
connections
that
are
there
between
what
we
call
each
each
group
so
largely
speaking.
Each
one
of
these
little
dots
is
is
a
cabinet
on
the
compute
side
and
to
First
approximation
and
then
on
the
rest
of
the
cabinets
and
all
the
service
cabinets
and
so
on,
and
you
can
see
the
network
connections
that
go
between
them,
as
well
as
the
connections
to
the
outside
world.
So
the
sort
of
deceptively
small
connection
here
is,
in
fact
our
our
high
bandwidth
connection
to
to
the
rest
of
nurse.
A
So
in
terms
of
the
software
I
think
you
know,
you'll
you'll
hear
more
about
this
to
sort
of
proxies,
as
proxies
and
the
talks
that
you're
getting.
But
you
know
we
have
a
very
rich
and
robust
programming
environment
as
well
as
support
for
various
programming
models
and
languages,
and
it
won't
go
into
great
detail.
A
But
it's
just
to
note
that
you
know
there's
a
very
good
coverage
of
of
things
that
are
both
formally
supported
by
the
vendor,
but
also
sort
of
nurse
supported
in
the
sense
that
their
staff-
and
this
is
the
the
team
that
you
know,
Jack
and
and
other
groups.
Rebecca
has
group
and
others
all
are
part
of.
Are
you
know
strongly
supporting
all
of
these
programming
environments
that
we
have
on
the
system.
That'll
help
make
the
system
very
productive
as
a
science
teaser.
A
A
The
the
pie
charts
on
the
left
here
show
you
that
breadth
of
user
base,
which
is
basically
we're
using
where
they
get
their
support
from
from
the
dodos
signs
under
which
one
of
these
offices-
and
so
you
can
see
here,
we
have
very
good
coverage,
broad
coverage,
both
amongst
the
GPU
and
the
CPU
node
usage
on
the
promoter
across
for
the
last
year,
and
that
sort
of
also
translates
to
the
kinds
of
science.
A
That's
happening
back
doctor
a
little
bit,
ranging
from
so
we've
sort
of
had
these
three
pillars
that
we
tried
to
support
on
promoter.
You
know
traditional
simulation
data
and
machine
learning,
and
so
we've
had
successes
in
all
three
of
these
pillars
at
the
beginning
of
this
rollout
of
permiter-
and
you
can
see
some
of
the
examples
here
on
on
this
slide
and
then,
finally,
in
terms
of
the
future
monster
I'm
out
of
time.
A
So
you
know,
promoter
is,
is
being
just
being
introduced
right,
so
it'll
have
a
long
and
story
life
ahead
of
it,
but
we're
not
sort
of
just
statically
going
to
keep
the
system
here.
There
are
a
whole
bunch
of
improvements
that
systems
folks,
as
well
as
the
the
user
integration
Fork
they're
doing
to
this
system
to
help
make
it
better.
Every
single
you
know
year,
and
so
we
can
sort
of
bucket
those
into
three
large
buckets.
A
One
is
sort
of
all
operational
improvements
that
help
us
keep
the
system
up
and
running
and
without
the
least
amount
of
impact,
and
hopefully
to
give
you
the
most
important
productivity
in
terms
of
the
user
environment.
We're
going
to
be
able
to
start
giving
access
to
container-based
environments
that
allow
users
a
lot
more
control
over
what
they
see
when
they
log
into
the
system
and
the
kinds
of
things
that
they
can
immediately
do,
which
is
a
government
for
productivity
and
then
on
the
access
side.
A
We're
going
to
have
a
whole
bunch
of
API,
driven
interactions
that
we're
going
to
be
enable
we're
going
to
be
able
to
enable
very
soon
including
things
like
new
ways
of
interacting
with
the
workflow
manager
through
the
city
of
restful
interfaces,
management
of
automated
tools,
git
line
Runners
for
CI
CD
stuff,
as
well
as
data
and
Improvement
operations.
So
let
me
stop
there
and
we're
really
thankful
to
the
community
here
of
users
who
who
have
been
able
the
system
to
be.
A
You
know
very
productive,
and
we
look
forward
to
to
giving
you
all
permanent
air
for
to
enable
great
science.
Thank
you
very
much.
Let
me
just
stop
sharing.