►
From YouTube: Introduction to Perlmutter
Description
Introduction to Perlmutter
Jay Srinivasan (NERSC)
A
Okay,
so
where
does
sort
of
promoter
said
in
the
painting
of
nurse
systems
here
right
between,
you
know,
skate
and
there's
10.,
so
there's
nine
is
sort
of,
as
Richard
mentioned,
we've
sort
of
continued
this,
this
transition
of
the
applications
and
workflow,
and
you
know
we're
starting
to
support
new
complex
workflows
on
the
system,
which
is
a
GPU,
a
mixed
CPU
and
a
GPU
system
right,
and
we
started
deploying
that
in
in
early
2021
in
parts
and
then,
where
we're
getting
very
close
to
to
rolling
out
it's
it's
full
capabilities.
A
Sort
of
very,
very
high
level
it.
This
is
what
it
looks
like
right.
So
so
there
are
CPU
only
nodes
which
are
the
AMD
epic
series.
There
are
GPU
accelerated
nodes
with
Nvidia
gpus.
In
there
we
have
an
all
flash
file
system
and,
from
a
user
environment
point
of
view,
we
have
workflow
nodes
and
high
memory
nodes
as
well.
A
As
you
know,
standard
sort
of
login
nodes,
all
of
which
are
are
hooked
up
to
you,
know
the
slingshot,
which
is
an
Ethernet
compatible
network
right.
A
In
addition,
the
system
pulls
in
other
resources,
from
nersk
external
file
systems,
high
bandwidth
connections
to
to
archival,
storage
and,
of
course,
the
other
system
that
we
have
on
the
floor,
which
is
Corey
right.
A
So
a
little
bit
more
detail
here
and
you
can
see
you
know
some
of
the
specifications
of
each
one
of
these
things
that
I
just
mentioned.
Okay
and
so
I'll
talk
a
little
bit
about
the
the
orchestration
of
how
we
set
up
the
system.
But
right
now
this
system
is,
is
our
first
system
to
where
the
system
management
portion
of
it
is
orchestrated
using
the
sort
of
Cloud
technology.
A
A
The
other
thing
that
promoter
has
is
you
know
all
of
the
computes
are
connected
to
the
nurse
network
with
a
resilient
high
band
with
linkage
year,
and
this
little
graphic
here
on
the
on
the
bottom
left
shows
you
what
that
looks
like
right.
So
promoter
has
a
multi
terabit
per
second
connection,
to
its
Edge
routers.
That
also
have
a
multi
smaller,
but
a
multi-terabit
connection
to
the
nurse
Network
itself
and
then
onto
esnet
in
the
world.
A
One
thing
to
note:
is
you
know
these,
so
the
system
itself
is
divided
up
largely
into
essentially
two
parts
right,
there's
the
whole
management
framework,
which
are
these
Gray
colored
rectangles
here
and
there's
the
compute
frame
nodes
partition.
If
you
will,
that
consists
of
the
GPU
and
the
CPU
nodes,
the
compute
nodes
are
all
direct
liquid
cooled.
Water
cooled
in
this
case
and
the
the
rest
of
the
management
framework
is
all
air
cooled,
wax
yeah.
A
So,
in
terms
of
the
hardware
capabilities,
like
I
mentioned,
the
GPU
accelerated
nodes
of
which
there's
a
example
right
here,
you
can
look
at
the
sort
of
complexity
of
all
of
the
cooling
and
and
the
heat
sinks
and
various
other
connections
that
are
there
on
the
Node.
They
have
four
a100
gpus
per
node
right.
A
Each
of
these
gpus
has
a
48
gigabytes
of
high
bandwidth
memory,
so
it
gives
you
a
160
gigabytes
of
memory,
high
bandwidth
memory
on
the
Node,
because
they're
all
linked
together
with
NV
what
what's
called
the
latest
generation
of
nvidia's
linkage
between
the
CPUs
in
this
case,
as
the
picture
here
shows,
it's
called
Envy
link
three.
A
In
addition
to
that,
to
help
Drive
the
node
itself,
they
have
a
AMD.
You
know
epic
7763,
also
colloquially
known
as
the
Milan,
and
so
these
nodes
have
got
all
in
addition
to
the
high
bandwidth
memory.
They
have
256
gigabytes
of
d-ram
per
node,
which
you
can
see
here,
which
is
also
actually
liquid
cooled
and
so,
and
on
top
of
that,
because
they
have
four
gpus
per
node,
we've
provisioned
them
with
four
network
cards
per
node,
and
in
this
case
these
are
the
latest
generation
of
hpes.
A
What's
called
slingshot
11,
which
are
capable
of
200
gigabits.
So
this
is
where
they're
connected
the
gpus
are
hooked
up
with
pcie
to
the
through
the
CPU,
and
then
those
are
also
hooked
up
with
pcie
into
the
next
and
then
go
to
the
outside
world.
And
then
the
gpus
are
all
hooked
in
each
other.
A
On
the
CPU
nodes,
which
I
don't
unfortunately
have
a
picture
of,
we
have
two
sockets
of
Milan
and
in
this
case
we
have
512
gigabytes
of
dbm,
because
we
have
a
little
bit
more
space
when
we
don't
have
the
gpus
in
there
and
they
have
one
slingshot
card
print
off
right.
So
if
you
look
at
this,
node
you'd,
basically
double
the
number
of
CPUs
and
then
remove
all
of
the
gpus
as
well.
So.
A
Let's
see
here
so
the
other
thing,
as
I
mentioned,
was
the
the
the
sort
of
overall
system
orchestration
framework,
and
so
here
you
have
a
picture
of
the
this
sort
of
high
level.
What
it
looks
like
right,
so
you've
got
all
of
the
non-compute
nodes
and
then
you've
got
compute
nodes.
Typically,
what
happens
is
you
have
some
small
number
of
non-compute
nodes
that
manage
and
boot
all
of
the
compute
nodes
and
handle
the
storage
and
so
on?
A
In
this
particular
case,
like
I
said
we
we've
we're
using
a
new,
relatively
new
to
HPC
at
least
orchestration
framework
called
kubernetes,
which
is
sort
of
a
service,
oriented
architecture
right,
and
that
allows
us
to
put
various
services
on
here.
A
You
know
more
or
less
resilient
manner
that
controls
all
of
the
booting
and
orchestration
of
other
services
on
on
all
of
the
non-compute
nodes,
and
then
the
compute
nodes
are
are
booted
using
this
framework,
but
they
don't
actually
are
they're
not
actually
controlled,
using
kubernetes
right,
they're,
just
booted,
and
so
they
run
Enterprise
Linux
environment,
which
in
this
case
is
Susa
Linux,
which
is
a
has
been
vendor
modified
with
certain.
You
know
drivers
and
things
like
that.
It
is
bare
metal
booted,
so
we're
not
controlling
it.
A
Using
you
know
any
the
the
environment
that
you
get
on
the
compute
node
is
not
ritualized
in
any
way,
but
we
have
additional
sort
of
vendor
provided
features
in
there
like
the
programming
environment,
various
other,
create,
Linux
features,
and
so
on.
A
Let's
see
in
terms
of
differences
from
query,
you
know,
Richard
pointed
our
quiz,
you
know
our
previous
system,
which
is
you
know,
coming
to
the
end
of
its
life
in
in
terms
of
Corey
at
a
high,
very
high
level.
You
know
both
of
these
have
a
dragonfly
topology.
Obviously
the
network
itself,
the
the
underlying
hard
Network
hardware
and
the
protocols
are
different.
A
One
of
the
the
Quarry
has
an
Aries
network
where
a
slingshot
has
a
well,
whereas
the
promoter
has
a
slingshot
Network
on
query,
though
you
can
see
here
all
of
these
additional
capabilities
such
as
login
nodes
and
the
you
know,
the
GPU
nodes
that
we
have
added
to
Cory,
as
well
as
the
storage
nodes,
are,
are
sort
of
separate
networks
that
overlap
through
essentially
gateways
into
into
the
Aries
Network
right
and
so
you've
got
the
K
L's
and
the
high
the
Haswell
nodes
and
so
on,
and
as
well
as
the
service
nodes,
all
of
which
are
part
of
the
Aries
Network
on
on
promoter,
though,
everything
is
part
of
the
one
slingshot
Network
here
that
sort
of
indicated
by
these
two
tiny
switches
that
we
have
there,
and
then
we
have
the
CPU
only
nodes,
as
well
as
the
GPU
accelerated
nodes
and
the
management
Network
right.
A
So
you
can
see
here.
This
is
a
a
picture
of
the
promoters
Network
and
you,
you
know
the
different
colorings
here
sort
of
show
the
different
groupings
of
the
these.
These
notes.
So
you
have
24
of
these
compute
group
nodes,
and
then
you
have.
Let
me
leave
12
or
12
of
these
Service
Group
nodes
of
the
I
o
nodes,
and
then
you
have
four
of
the
service
movements
right.
So.
A
In
terms
of
software,
we
we
sort
of
have
a
really
rich
programming
set
of
programming
environments
that
we
support
as
well
as
you
know,
programming
models
and
languages
and
all
of
which
are
sort
of
put
together
by
some
Community
codes
that
we
that
we
support
as
well
right
and
I
won't
go
into
too
much
detail
on
running
out
of
time
here.
But
you
know
that's
what
we
have
in
terms
of
the
science.
A
This
is
sort
of
just
a
teaser
for
I
think
Richard
mentioned
our
utilization
has
been
very
good.
A
If
you
look
at
sort
of
the
pie,
charts
of
usage
of
both
the
CPU
nodes
and
the
GPU
nodes
this
year,
you
can
see
that
the
the
offices
that
Richard
talked
about
in
terms
of
who
are
our
user
base
comes
from
is
very
widely
distributed
amongst
the
usage
of
of
the
machine
right,
yeah
right
both
from
the
CPU
side,
as
well
as
the
the
GPU
site,
and
here
are
some
sort
of
teasers
for
the
kinds
of
science
that
we've
supported
in
the
lab
here
right,
ranging
from
standard
sort
of
simulations
to
sort
of
newer
models
and
data
and
learning,
as
well
as
cross-facility
kind
of
workflows
that
we're
supporting
in
the
super
facility
right
from
Desi
to
to
lcls
and
others
for
the
future
promoter.
A
We're
really
looking
forward
to
to
getting
it.
You
know
into
people's
hands
in
its
full
capability,
but
we're
not
standing
still
right.
So
we're
going
to
make
a
number
of
operational
improvements,
improvements
to
how
you
were
able
to
access
the
system,
as
well
as
the
kinds
of
things
you'll
see
when
you
do
get
on
the
system
right.
We
we
in
part
have
continuous
operations
now,
although
we
will
have
continue
to
have
maintenances
and
updates
to
the
system
and
we're
hoping
to
do
most
of
those
non-disruptively
to
to
our
users.
A
Right-
and
you
know
not
right
now,
but
in
the
future,
where
very
soon
we're
hoping
to
be
able
to
give
people
access
to
Dedicated
containers
when
they
log
in
which
will
initially
give
you
standard
images,
but
also
where
holding
out
the
possibility
of
giving
the
users
the
ability
to
to
customize
their
images
in
in
in
in,
in
a
few
different
ways.
Not
you
know
completely
sort
of
a
wild
west
approach,
but
we
hope
to
give
users
the
ability
to
customize
their
images
and
then
we're
going
to
give.
A
As
I
mentioned.
We
have
this
really
robust
and
resilient
management
infrastructure,
where
we'll
give
users
the
ability
to
run
these
long-running
services
that
can
be
managed
using
the
kubernetes
framework
that
allows
for
resilience
and
other
services
and
then
from
a
user
access
perspective.
Once
you
do
get
on
the
system,
we're
hoping
to
open
up
a
much
much
richer
way
of
interacting
with
things
on
the
services
on
the
system,
including
things
like
restful
interfaces
to
our
workload
manager,
the
ability
to
run.
A
You
know
against
lab
Runners
that
will
help
with
the
cicd
folks,
as
well
as
other
data
movement
operations.
But
let
me
just
stop
there.