►
From YouTube: 1 Intro to KNL on Cori (NERSC Cori KNL Training 6/2017)
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Okay,
so
thank
you
all
for
coming.
Welcome
to
the
latest
knights
landing
new
user
training.
This
is
the
last
kml
training
that
we
will
do
before
the
KL
nodes
go
live
before
we
start
charging
on
Corey,
so
hopefully
by
then
you
all
will
be
familiar
with
the
architecture
and
the
hardware
and
software
and
how
to
get
your
codes
running
well.
A
So
for
those
of
you,
following
along
at
home,
the
Knights
nodes
were
installed
in
Corey
towards
the
later
end
of
last
year,
they've
been
in
sort
of
a
pre-production
phase
for
since
then
so
not
quite
a
year.
That
phase
consisted
of
no
charge
for
users
who
were
enabled
for
those
nodes
relatively
limited
access
until
recently,
so
all
of
Nerf's
users
should
be
able
to
use
the
kml
nodes.
A
A
If
memory
serves
I
believe
it's
96
nurse
hours
per
KNL
node
hour,
which
is
a
little
bit
more
than
the
Haswell
nodes,
and
hopefully
there
will
be
fewer
down
times
more
stable
software
environment
and
everyone
will
be
running
well,
so
KL,
formerly
called
Xeon
Phi
KL
is
a
code
name
that
Intel
used
until
they
released
it.
Kl
is
the
second
generation
of
the
Xeon
Phi
architecture
from
Intel,
unlike
Knights
corner,
which
some
of
you
may
have
used.
It's
a
self-hosted
architecture,
so
the
operating
system
runs
directly
on
the
knight's
Hardware.
A
There
are
significant
improvements
in
both
scalar
and
vector
performance
over
Knights
corner.
There's
this
new
unpackage
high
bandwidth
memory
on
this
last
point.
The
fabric
on
package
is
actually
not
relevant
to
Cori.
We
don't
use
Intel's
fabric,
we
use
craze.
So
you
could
ignore
that
last
point
intel
does
release
three
versions
of
knights
landing.
We
are
the
one
on
the
far
left,
KNL
self
boot.
The
other
two
are
part
of
other
systems
at
other
labs,
but
that's
not
what
we
have.
A
A
You
can
see
the
little
the
MCD
Ram
is
the
on
package
high
bandwidth
memory,
so
there
are
I,
believe
eight
channels
of
MCD
Ram,
six
channels
of
ddr4
and
the
part
that
we
have
so
this
diagram
says
yeah.
Yes,
36
tiles,
which
is
72
cores
at
I,
love
two
cores
and
that's
the
part
that
we
have
no
sorry.
We
have
68
course.
We
have
34
tiles,
sorry
and
it
has
significantly
higher
peak
double
precision
and
single
precision
floating
point
performance
over
both
K
and
C
and
over
Haswell
nodes.
A
This
slide
claims
three
teraflops
in
reality,
you'll
probably
see
a
little
bit
less
than
that
and
unlike
Haswell,
which
is
a
two
socket
system,
at
least
the
version
we
have
on
Corey.
The
nice
nodes
are
one
socket
each
and
so
they're
16
gigabytes
of
this
high
bandwidth
memory,
which
can
reach
somewhere
around
400
gigabytes
per
second
of
bandwidth
compared
to
about
a
hundred
of
the
ddr4.
A
A
A
A
A
But
an
important
thing
to
keep
in
mind
is
that
Intel
will
give
us
and
they
rarely
take
us
away
when
it
comes
to
instruction
set.
So
anything
that
was
compiled
on
several
generations
prior
prior
to
Knights
landing
will
still
run
on
Knights
landing
without
recompiling.
So,
for
example,
you
would
not
have
to
recompile
Emacs
or
them
to
run
on
Knights
landing,
which
is
nice.
A
A
A
So
this
is
a
comparison
of
Cory.
The
Knights
landing
nodes
on
Cory
versus
Edison,
which
uses
the
Ivy
Bridge
see
on
architecture.
Edison's
are
older
of
the
two
craze
that
we
have
in
production
right
now.
So
this
just
gives
you
an
idea
of
how
things
have
changed
is
a
very
high
level
and
things
that
users
will
need
to
be
aware
of
when
they're
migrating
from
Edison
to
quarry,
and
particularly
the
Knights
landing
nodes
on
quarry.
A
Has
a
single
socket
in
sixty
eight
cores
Edison
also
support
the
Ivy
Bridge
nodes
also
support
two
hyper
threads
per
core
Knights
landing
supports
four,
so
the
number
of
hyper
threads
that
are
active
per
node
has
gone
up
from
24
times
2,
which
is
48
up
to
272,
so
the
amount
of
hyper
threads
that
you
can
keep
running
in
total
has
gone
up
a
lot.
So
that's
all
good
larger
numbers
are
usually
better.
A
However,
there
are
important
things
to
keep
in
mind,
such
as
the
clock,
speed,
which
has
gone
down
by
about
a
factor
of
two,
so
Edison
sort
of
fluctuates
around
two
and
a
half
gigahertz
the
Knights
landing
in
haswell's
about
2.3
gigahertz.
Ninth
landing
is
about
half
of
that
between
about
one
point,
two
and
one
point
four,
so
the
clock
speed
has
gone
down
quite
a
bit,
which
means
that,
in
order
to
get
performance,
we
have
to
find
that
performance
elsewhere
and
the
way
we
find
it
is
usually
through
more
parallelism.
A
So
Edison
only
had
one
type
of
memory,
which
I
believe
is
ddr3
64
gigabytes
per
node,
which
comes
out
to
about
two
and
a
half
gigabytes
per
cord
Cory
has
two
types
of
memory,
so
the
memory
hierarchy
is
a
little
bit
more
complicated.
Now
it
has
16
gigabytes
of
this
on
package,
this
MCD
Ram
with
about
400
gigabytes
per
second
and
that
96
gigabytes
of
ddr4,
which
is
112
total.
A
So,
if
you're
in
cache
mode,
if
you're
running
in
cache
mode,
which
is
a
memory
mode
that
you
can
run
in
which
we'll
talk
about
in
a
little
bit-
that's
112
gigabytes,
total
of
memory
per
node
that
you
have
available,
which
comes
out
to
something
like
1.6
gigabytes
per
core
if
you're
running
in
flat
mode-
and
you
just
want
to
use
the
high
bandwidth
memory
exclusively-
it's
much
lower.
It's
about
230
megabytes
per
chord.
So
the
memory
per
core,
no
matter
how
you
look
at
it-
has
gone
down
quite
a
bit
in
comparison
to
Edison.
A
A
Again,
this
is
a
in
terms
of
the
instructions
that
you
can
issue
on
a
KML
node.
You
can
see
the
two
columns
on
the
Left
show
the
Sandy
Bridge
and
Haswell
our
instruction,
so
Sandy
Bridge
I,
believe
yeah
Sandy
Bridge
has
supports
up
to
AV
X,
which
is
128-bit,
Haswell
supports
256
bit
via
avx2,
and
then
KML
supports
avx-512,
which
is
512
bit,
so
the
instructions
of
the
vector
widths
are
getting
wider.
A
A
Intel
has
not
taken
anything
away
in
terms
of
instructions,
so
anything
that
you
compiled
on
house
well,
even
on
Sandy,
Bridge
or
Ivy
Bridge
will
still
run
on
KL.
It
may
not
run
well,
it
may
not
run
as
well
as
it
can,
because
it
won't
have
these
five,
these
avx-512
instructions,
but
it
will
run
so
for
some
applications,
which
don't
depend
very
significantly
on
performance,
for
example,
vim
or
Emacs.
A
A
So
I
mentioned
memory
modes
in
the
last
slide,
so
KL
also
has
a
new
feature,
which
is
configurable
memory
modes.
This
was
not
available
on
Haswell
or
on
Ivy
Bridge,
and
what
I
mean
by
memory
modes?
Is
you
can
actually
decide
how
you
want
the
high
bandwidth
memory,
this
MCD
RAM
and
the
DDR
to
interact
with
each
other
so
the
first
mode,
which
is
conceptually
the
simplest
in
terms
of
writing
code
to
use
it?
It's
called
cache
mode
and,
as
the
name
suggests,
this
treats
the
MCD
Ram
as
a
transparent
cache.
A
So
it's
not
a
separately
addressable
piece
of
memory
that
you
can
allocate
and
that
you
can
allocate
memory
to
it's
totally
transparent
to
you.
So
it's
while
night
slanting
does
not
have
an
l3
cache
like
Haswell.
Does
you
can
sort
of
emulate
an
l3
cache
using
MCD
Ram
in
this
way?
It's
not
as
fast
if
the
latency
for
Dempsey
DRAM
is
not
as
fast
as
a
true
l3
cache,
but
it's
definitely
better
than
nothing.
A
There
are
some
potential
downsides
to
using
cache
mode,
which
I
believe
are
on
the
next
slide,
but
that
is
one
mode,
that's
available
to
you
and
so
in
from
a
practical
perspective,
when
you're
running
jobs
in
your
slurm
script
and
the
little
stands
at
the
top,
when
you're
choosing
the
partition
and
the
time
limit,
and
so
on,
there's
actually
a
motors
that
you
can
specify.
In
fact,
you
must
specify
a
constraint
when
you're
running
on
a
night
when
you're
requesting
knights
landing
nodes,
you
actually
have
to
tell
it
what
memory
mode
that
you
want.
A
In
fact,
if
you
don't,
it
will
just
give
you
something
and
that
you'll
find
out
at
runtime
what
you
got
so
another
mode
that
you
can
use
is
called
flat
mode
and
in
this
case
the
MC
Graham
is
actually
configured
as
a
separate
Numa
domain.
So
now
it
your
when
you
ask
the
hardware:
what
memory
do
you
have
available?
It
will
tell
you:
I,
have
two
banks
of
memory:
I
have
96
gigabytes
of
this
and
16
gigabytes
of
high
bandwidth
memory
flat
mode
in
terms
of
performance
can
do
better
than
cache
mode.
A
The
downside
is
that
it's
a
little
bit
more
complicated
to
use
either
you
need
to
have
an
application
which
has
a
fairly
small
memory
footprint,
because
16
gigabytes
is
not
very
much
or
you
have
to
explicitly
address
memory,
and
there
are
tools
for
doing
that.
But
you'll
have
to
write
in
your
application,
specific
malloc
and
free
statements
that
say
use
MCD,
Ram
or
use
DDR.
A
So
it's
a
little
more
complicated,
but
you
can
get
faster
performance
and
then
a
third
option
is
hybrid
mode.
Where
you
can
actually
choose
a
little
bit
of
both
you
can
you
can
decide
to
have
a
little
bit
of
memory
and
flat
mode
and
then
a
little
bit
in
cache
mode.
This
is
not
a
particularly
popular
model.
In
fact,
I
personally
have
never
used
it
before
it's
quite
a
bit
to
manage,
and
so
generally
users
will
prefer
either
pure
cache
mode
or
pure
flat
mode,
and
so
you
can
choose
either.
A
You
can
choose
any
of
these
modes
that
you
want
when
you
run
your
job.
The
downside
is
that
these
have
to
be
configured
at
boot
time.
So
if
a
node
in
the
mode
that
you
request
is
not
available,
slurm
will
actually
have
to
reboot
the
node,
and
that
is
I
forget
what
the
actual
time
that
it
takes
these
days
is,
but
you
can
expect
20
minutes
to
half
an
hour.
So
it's
not
short.
A
A
A
Let's
see,
if
I'd
get
this
right,
yeah
so
on,
the
left
is
KL.
This
is
in
flat
mode,
so
you'll
actually
see
a
separate
Numa
domain
with
16
gigabytes
of
MCD
Ram
and
another
Numa
domain.
That
has
the
DDR
it.
In
contrast
on
Ivy
Bridge
or
on
Haswell,
you
will
always
see
two
Numa
domains.
You
can't
choose
to
have
just
one,
because
there
are
two
physical
sockets,
so
the
general
use
case
for
running
in
flat
mode.
A
If
you're,
if
you
need
to
allocate
memory
explicitly
into
the
high
bandwidth
memory
generally,
you
want
to
choose
the
memory
that
will
be
used
most
often
in
your
application,
so
that
the
bandwidth,
because
there's
not
much
storage
available,
only
16
gigabytes.
You
want
to
devote
that
storage,
the
limited
storage
to
to
memory
which
needs
to
be
which
needs
to
have
very
high
bandwidth.
So
you
can
get
good
performance.
So
there
are
two
ways
to
explicitly
allocate
memory
into
a
high
bandwidth
memory.
A
One
is
through
this
mem
kind
library
which
we
already
have
on
Cori
and
there's
a
little
description
of
the
API
on
the
next
slide,
I
believe
so,
if
you're
writing,
C
or
C++
applications.
This
is
Genesis
mem
kind,
libraries
generally
what
you
will
use
the
other
option
if
you're
writing
Fortran
codes
and
you
need
to
allocate
memory
into
MCD
Ram-
is
to
use
a
fast
named
directive,
which
is
an
Intel
compiler
directive.
It's
not
part
of
the
Fortran
standard,
it
is
an
Intel
specific
directive,
but
that
option
is
available
to
you
as
well.
A
A
So
you
have
to
include
a
I
believe,
there's
a
there's,
a
header
file
that
you
need
to
include,
which
is
probably
been
kind
H
and
then,
in
terms
of
what
code
you
need
to
change.
So,
if
you're
using
mallets
and
fries
like
in
C
code,
all
you
need
to
do.
Is
you
change
your
malloc
to
hbw,
malloc
and
so
hbw
malloc
will
explicitly
allocate
memory
into
high
bandwidth
memory,
I,
don't
believe
there
are
corresponding
new
and
free
statements.
A
A
A
Well,
thank
you
for
the
clarification,
so
that
is
mem
Khan
on
the
left
and
then
on
the
right.
If
you're,
writing,
Fortran
applications
and
you
need
to
explicitly
allocate
high
bandwidth
memory,
there's
this
fast
named
directive,
which
is
shown
in
red.
So
this
deck
is
an
Intel
specific
directive.
So
again,
this
is,
if
you
compile
with
the
Craig
compiler
or
with
new,
they
won't
know
what
to
do
with
this
flag
and
I.
Think
they'll
actually
ignore
it
entirely.
A
But
you
just
you
just
append
this
fast
mem
stanza
to
the
array
or
the
memory
that
you
are
going
to
allocate
already
I,
believe
fast,
mem
only
works
with
allocatable,
arrays
and
I.
Believe
that's
true.
So
if
you
just
have,
for
example,
real
a8,
if
you
have
a
stack
array,
I,
don't
believe
fast
men
will
work.
It
requires
allocatable,
the
allocatable
attribute.
A
C
Yes,
the
comment
from
the
audience
is
just
pointing
out
that
there
is
that
there
are
men
modules
available
at
nurse
for
so
module
avail.
Mem
kind
will
show
you
the
options:
okay,
thanks
Brian
for
the
introduction
to
the
architecture
I'm
just
going
to
describe
in
a
little
bit
more
detail,
some
of
the
other
architectural
options
in
KL.
C
C
So
the
the
most
common
mode,
and
also
the
default
in
almost
every
application,
is
quadrant
mode.
So
in
quadrant
mode
the
Xeon
Phi
chip
is
exposed
as
a
single
Numa
domain,
but
internally
it's
divided
into
four
virtual
quadrants,
and
what
these
virtual
quadrants
mean
is
that
the
the
memory
addresses
are
hashed
to
directory.
That's
the
distributed
directory
in
the
cache
hierarchy
are
their
address
to
the
same
quadrant
as
the
core
that
it
comes
from.
C
So
what
this
means
is
that
you
get
some
benefit
of
of
locality
when
you
try
to
access
some
memory,
that's
local
to
your
quadrant
and
in
general
this
is
probably
the
easiest
mode
to
use
in
terms
of
performance
ease
of
use
trade-off.
You
only
have
to
worry
about
one
Numa
domain.
You
don't
have
to
explicitly
manage
your
locality
and
you
get
a
bandwidth
benefit
versus
having
an
all-too
all
mode
where
there's
no
affinity
between
between
the
address
and
the
directory.
C
And
I
guess
I
could
describe
so
the
in
this
block
diagram.
It
shows
what
happens
when
at
a
core
on
topless
tile
with
labeled
one
goes
to
access.
Some
memory
address
that
it
doesn't
own.
So
the
the
request
travels
across
the
mesh
to
to
a
directory
that
tells
which
MC
DRAM
controller
to
look
to
request
the
memory
data
from
and
then
that
gets
sent
back
across
the
the
on
chip
mesh
to
the
tile.
C
So
another
way
to
configure
this
chip
is
through
the
sub
Numa
clustering
modes.
So
there's
there's
actually
two
different
sub
SNC
modes.
You
can
either
divide
the
chip
in
half
or
you
can
divide
the
chip
into
quadrants
and
in
this
case,
that
division
is
exposed
to
the
operating
system
and
a
user.
So
if
you
choose
SN
C
to
mode
then
it
the
chip
appears
is
to
two
separate
Numa
domains,
so
that'll
be
familiar
to
anyone,
who's
use
them
dual
socket
Gionta
nodes
on
Edison
or
the
Haswell
partition
on
Cory.
C
So,
in
general,
the
standard
advice
is
to
stick
with
quadrant
and
certainly
worth
checking
the
other
modes.
But
quadrant
is
by
far
the
easiest
and
and
pretty
much
every
case.
I've
seen
gives
very
close
to
optimal
performance
compared
to
the
others.
So
yeah,
as
Brian
mentioned
earlier,
there's
a
number
of
ways.
You
can
configure
the
memory,
the
odd
package
memory
and
the
Xeon
Phi.
C
C
However,
in
this
case,
if
you
are
using
lots
of
memory-
and
you
have
a
cache,
miss
then
you're
getting
a
latency
penalty
because
you've
gone
to
the
MC
DRAM
and
now
you
need
to
go
to
the
DDR
in
flat
mode.
It's
exposed
as
a
separate
Numa
domain
that
just
contains
the
memory,
and
that
can
be
you
can
place
your
allocations
there
through
Numa
control
and
Numa
control
or
hardware
or
LS
CPU
can
give
you
all
the
details
of
the
exact
hardware
configuration
that
you
have.
C
C
So
just
the
the
main
choice
is
usually
between
cash
and
flat
and
both
have
their
upsides
and
downs.
So
in
cash
mode.
The
benefit
is
that
it
it
kind
of
works
out
of
the
box.
You
don't
have
to
change
your
code
or
invoke
any
fancy
or
any
changes
to
your
run.
Scripts
and
you
do
get
a
significant
bandwidth
benefit
over
running
out
of
just
the
DDR.
C
So,
if
you're
really
trying
to
use
as
much
memory
as
possible
on
a
node,
this
mode
will
take
some
of
that
away
from
you
in
flat
mode.
Is
the
the
most
performant
choice
if
you're
running,
if
your
application
is
using
less
than
16
gigabytes?
So
if
you
know
for
sure
that
you're
only
using
you
know
10
gigabytes
per
node,
then
you
can
run
entirely
out
of
MC
DRAM
and
get
the
maximum
performance
possible.
C
C
C
C
You
can
use
newbie
control
to
bind
the
memory
allocations
to
a
specific
Numa
domain
option
beans
using
libraries
like
the
ones
that
kind
the
Brian
talked
about
earlier
and
mem
kind
can
enable
some
compiler
directives
for
Fortran,
or
you
have
to
rewrite
your
your
bollocks
and
freeze
very
system
Alex
in
NC
code.
C
C
Mine
is
hardware
which
shows
you
what
you
have
and
then
Numa
control,
member
and
equals,
which
tells
which
Numa
nodes
to
bind
your
memory
to
then
there's
there's
also
these
preferred
and
interleave
options
and
those
those
allow
so
preferred
does
allow
you
some
control
over
letting
your.
If
your
memory
allocations
might
go
over
16
gigabytes,
instead
of
just
running
out
of
memory,
they
can
spill
over
to
the
DDR.
C
C
So
if
your
launch
command
can
be
quite
complex,
but
for
one
for
for
single
MPI
ranks,
you
can
use
Numa
control
and
then
specify
a
comma-separated
list
of
Numa
domains
that
memory
allocations
should
be
in
and
again
Numa
control.
Minus
capital
H
tells
you
about
the
actual
Hardware
layout
that
you're
looking
at.
C
C
C
C
Yeah,
so
the
question
was
was:
is
there
a
huge
page
support
for
MCD
Ram
and
yes
I?
It
works
the
same
way.
I
think
it
at
one
point
there.
There
is
a
bug
with
this.
There
was
a
bug
with
a
specific
size
of
huge
page
in
OC
DRAM,
but
I
think
that's
solved
at
this
point,
but
yeah
you
can
compile
with
the
crepey
huge
pages,
whatever
module
and
then
use
your
application
as
normal.