►
From YouTube: 2 Using Cori KNL Nodes (NERSC Cori KNL Training 6/2017)
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Next
topic
is
about
compiling,
so
we'll
talk
about
available
nodes.
Cori
has
two
types
of
compute
nodes.
Has
one
node
and
Kay
now
compute
nodes,
so
assume
most
users
are
familiar
with
already
compiling
running
on
hello
nodes.
We
just
want
to
compare
two
hewa
knows
what
a
curry
can
on
out.
Have
it
has
much
slower
cpus
and
more
teachers
per
node
and
also
a
smaller
memory
per
CPU.
It
has
longer
vector
lengths
so
with
me
to
explore.
It
also
has
completed
memory
hierarchy,
especially
when
you
consider
MCD,
Ram
and
different
cluster
mode.
A
It's
that
is
part
of
option.
So
I
talked
about
one
of
the
point
is
that
so
binaries
beautiful
can't
as
well.
If
you
bring
a
house
word
binary,
you
can
run
it
on
Keanu
already,
but
it
won't
be
optimized
because
you
didn't
explore
longer
vector
lens
or
didn't
explore
more
of
the
threading
options,
but
these
runs
are
not
vice
versa.
Because
of
the
new
instructions
introduced
in
KL,
intrinsic
avx-512.
A
So
on
Cory's
the
situation
is
our
configuration.
You
may
hear
about
some
other
channels.
They
have
your
the
ability
of
compile
on
Canon
O's,
even
if
it
does,
it
will
be
much
much
slower.
So
we
don't
recommend
that,
especially
for
foreigners
correct
configuration,
we
we
have
a
different
setup
of
the
OS
image
on
our
computers,
so
that
compiling
on
Cana
is
not
supported.
So
we
have
to
do
a
cross,
compile.
A
When
you
have
to
compile
on
the
logging
node
so
slide,
so
this
slide
doesn't
show
up
well
from
converting
comp
and
PowerPoint
to
to
Google
but
Google
yeah,
but
but
if
you
have
seen
it
in
the
previous
talk,
basically
there's
the
MTD
remote
setup
is
relevant
to
both
compile
and
run
time
and
Corey's.
I
wanted
to
briefly
mention
it
and
cache
mode
of
the
amphibian
setups
that
it's
the
the
high
bandwidth
memory
is
considered.
A
The
cache
and
it's
always
transparent
to
users
and
the
flat
mode
is
that
you
were
wanted
to
manipulate
exclusively
that
you
can
put
your
memory
a
code
memory
into
the
MPD
Ram
and
there's
different
ways
of
accessing
it.
I'll
talk
about
more
details
in
the
later
and
there's
also
hybrid
mode
that
you
can
do
half-half
or
a
split
75%
25%
different
ways.
A
A
You
know
something
the
wrappers
will
note
or
link
to
those
libraries
automatically,
and
the
wrappers
will
also
find
the
MPI
libraries
and
create
scientific
libraries
their
link
for
it
automatically
for
you,
so
that
users
don't
have
to
explicitly
add
those
paths
so
in
the
deeply
default,
compiler
or
user
environment
is
the
Intel
compiler
environment.
And
if
you
want
to
use
a
different
compiler,
you
are
you
to
do
it.
You
need
to
do.
Is
module
swap
program,
env,
intelligence,
a
programming
and
a-and
to
compile
Fortran
code?
A
A
So
on
house,
where
log
you
know,
is
crepey
II
heart
as
well,
and
the
wrapper
will
link
the
X
is
perhaps
our
link
X
Corps,
a
DX
for
Intel
compiler
and
a
link
basically
interlink
the
the
corresponding
on
target
for
the
machine
type
into
the
wrapper,
so
that
the
binary
boot
is
targeted
for
that
architecture
and
for
the
core
ekl.
That
module
is,
if
you
want
to
be
a
target
towards
the
channel,
you
want
to
create
load,
creepy,
Mike
Kandel,
and
so
this
is
what
the
next
slide
is
about.
A
So
now
we
try
to
build
a
cane
or
target
binary
on
house
were
logging
of
it.
All
we
need
to
do
is
that
this
is
the
best
recommendation.
We
tell
our
user
to
do
if
you
want
to
do
that,
you
do
module,
swap
scraping
you
as
well
creepy,
Mike,
Kandel
and
then
just
go
on
with
the
wrapper,
and
that's
that
simple,
the
wrapper
will
take
care
of
the
target
for
you,
they're
also
linked
to
the
corresponding
K
and
our
libraries.
A
A
So
another
option
is
sometimes
the
user
want
to
build.
A
binary
that
can
run
on
both
target
on,
has
well
and
found
KL,
so
Intel
compiler
has
a
flag
specifically
means
that
I
want
to
build
a
binary
target
for
multiple
architectures,
so
the
flag
is
be
putting
a
X,
and
then
you
put
Mike
avx-512,
meaning
al
&.
Co
avx2
is
targeting
for
as
well,
so
that
your
binary
now
actually
runs
on
both
oxygen
architecture.
It
is
very
small
overhead,
the
runtime
and
I'll
figure
out
now.
A
Your
current
target
is
an
hour
to
start
branch
and,
finally
to
run.
However,
we
just
know
that
it's
not
as
good
because,
for
example,
some
of
the
libraries
under
the
hood,
if
you
have
when
you
compile,
if
you're
under
the
house
or
environment,
it
will
only
pick
the
house
world
branch
of
that
library
for
them
from
the
inker
and
kale.
A
So
the
recommendation
is
not
is
still
try
to
swap
some
audio
at
the
previous
slide
mentioned.
Also,
the
disadvantage
of
the
second
option
is
that
it
only
works
for
the
Intel
compiler
now
about
how
the
link,
so
the
rapper
we
already
mentioned
are
the
links
in
the
libraries.
But
this
topic
is
about
whether
you
should
link
to
a
helper
library
or
link
to
a
kennel
library.
A
A
Sometimes
we
build,
we
would
put
a
pseudo
folder
there
look,
but
it's
actually
a
link
to
the
truck
I.
Just
the
link
back
to
you
can
l22
as
well.
So
most
of
those
will
be
actually
still
linking
to
the
hassle
library
and
if
they're
performance,
critical
libraries,
then
you
want
to
link
a
link
to
KL
so
either
Cray
would
provide
those
or
no
squid
providers
and
so
be
patient.
Not
not
everything
is
ready,
but
if
you
do
need
something
you
know
you
can
stand
there
the
ticket
to
us
and
can
work
on
us.
A
You
know
in
higher
priority,
but
some
of
them,
for
example,
are
like
ray
lips.I,
fftw
pepsi
and
the
third-party
scientific
library
they
already
have
a
special
beautiful
canal
and
for
for
mkl
it
also
has
talked
a
no
target
optimizations.
But
there's
a
special
note
that
when
you
want
to
use
it,
you
want
you
need
to
link
with
lignum
kind.
So
there
are
two
kinds
of
length:
dots
that
we'll
have
another
slide
later
the
point
of
that
these
users.
You
do
not
need
to
prepare
libraries
canal.
A
So
this
is
the
convention
of
nurse
software.
We
use
user,
common
software
and
a
software
package,
the
version
number
and
then
either
has
well
okay
now
and
then
the
different
compilers.
So
that's
there's
an
example
of
pepsi
3.74.
It
has
both
hsw
and
KL
and
with
Intel
compiler,
there's
also
a
different
compiler
other
compilers.
A
A
So
there's
one
issue
related
to
cross
compilation
is
that
some
of
these
opera
beaut
systems,
you
need
to
run
a
small
test
system
at
during
the
blending
process.
Freedom
we
needed
to
put
up
to
generate
a
make
file,
so
that
will
break
because
that
binary,
you're
already
beautiful
KL.
It
won't
run
out
as
well.
So
there
are
a
few
workaround
to
you
of
those
when
it's
all
account
what
you
want
to
do
is
you
do
your
configuration
step
still
under
the
has
well
environment?
And
then
you
do
your
small
test
program
there.
A
After
that,
you
swap
to
the
mighty
NL
and
then
the
make
you
will
actually
run
with
the
flag
will
compile
so
that
you
get
the
correct
binary
and
will
see.
Make
Korea
has
worked
out
with
the
C
make
vendor
that
thinks
versions.
We
five
zero
and
there
are
two
there's
just
a
simple
commands.
You
export
creo
S
version
six
and
then
you
define
see,
make
system.
Nickname
is
cray,
Linux
environment.
Then
there's
no
concentration
issue
for
that
right.
So
we
talked
about
basic,
compile
basic
link.
Now
I
want
to
touch
upon
the
MCD
ran.
A
So,
in
theory
would
be
talked
about
in
the
programming
compiling
part
and
also
about
also
again
in
the
running
jobs
apart.
Part
for
compiling
part
because
depends
on
how
you
want
to
use
in
different
modes
and
how
you
want
to
link,
because
some
of
the
way
of
using
it,
you
need
to
link
to
a
special,
specific
command
kind
library.
That's
why
it's
in
the
compiling
part,
so
just
to
briefly
review
overview,
the
MC
diagram.
It
has
two
ways
to
use
it
cache
mode
and
flat
mode.
A
The
cache
mode
talked
about
it,
so
the
cache
mode,
basically
no
change
to
make
encode
in
your
build
procedure
or
wrong
procedure.
It's
it's
all
taken
care
of
by
the
OS
you
transparently
and
it's
free.
So
that's
why
the
cache
mode
is
actually
our
default
hooked
from
many
most
users,
but
we
also
allow
users
to
explore
different
mode
too.
So,
if
you
want
explore
explore
flat
mode,
what
you
want
to
do
is
I
want
a
few
ways
to
to
I
want
to
use
mg/dl.
A
One
way
is
say:
I
want
to
put
my
big
data
into
mg/dl.
So
if
you
know
it's
MTD
ran
on,
the
node
is
only
16
gig.
If
you
know
it's
15
16
gig,
it's
great,
then
you
just
say
as
long
different
options
and
then
you
mercy,
gee,
L,
M
1,
so
because
the
one
that
Numa
to
do
my
earnings
I
want
to
force
all
my
memory
allocation,
a
Numa
node
1,
which
is
the
MCU
m
in
the
flat
mode.
There's
another
option
for
estarĂ¡n.
Is
man
bind
you
say,
map
from
exam
one
and
that's
it?
A
But
if
your
memory
is
over
16
gig
and
use
this
way
it
doesn't,
it
doesn't
fit
it'll
fail.
So
another
way
is:
if
you
know
it's
too
big
of
you
not
sure
you
can
say:
P
means
preferred
and
it's
an
N
and
s
none
option
s.
One
option
preferred
last
man
as
well,
and
the
other
option
is
explicitly
say:
I
know
of
my
memory.
Usage
is
over
16.
Gig
I
also
do
not
want
the
system
to
just
put
my
first
16
gigs
into
there.
I
want
to
choose.
A
I
want
to
use
my
heavy
big
data
into
it.
So
what
do
you
want
to
do?
Is
you?
Can
you
know,
use
HP,
W
malloc
to
replace
malloc
and
is
a
that's
that
the
C
C
way
of
using
it
and
fortunately
and
four
different
compilers
is
different
directives,
so
the
first
line
is
the
great
way
if
they
deep
band
director
memory
and
bandwidth
and
then
you
erased
and
for
Intel
is
say
there
aren't
attributes
fat
man
and
your
arrays.
A
So
the
way
it
does
is
that
it
only
works
for
the
dynamic,
a
lot
allocated
arrays.
So
your
stack
of
variables
fortune
pointers,
cannot
go
there
with
this
method.
So
now
this
is
how
you
change
your
program,
and
then
we
should
talk
about
when
you
use
lips
to
mankind.
How
you
compile
go
back
to
is
that
one?
How
you
compile
and
the
two
ways
create,
has
provided
a
cream,
M
kind
module,
and
you
know
that
module
and
then
again
use
fgn,
OC
c
compiler
wrappers.
The
wrapper
will
add
these
flags
into
the
beaut
command.
A
It'll
actually
add
dynamic.
So
then,
you're
you
know
generated
outbound.
Binary
is
dynamic.
So
by
default
before
without
the
stash
dynamic,
which
this
is
the
jet
as
a
default
way
of
the
compiler
usage.
Most
of
applications
are
viewed
static,
so
you
may
have
lots
of
static
libraries
lying
around
so
building
by
as
dynamic
may
not
be
a
you
know,
preferred
way.
So
in
that
case
the
nurse
has
Amalia
called
mankind
and
it
links
with
mankind
J
malloc
without
dynamic,
so
that
you
can
still
build
a
static
binary.
A
So
I
talked
about
putting
into
the
to
yourself
chosen
variables
into
HBM,
and
then
how
do
we
choose
so
there's
a
tool.
I'll
lead.
You
there's
also
a
return.
The
memory
access
collection
option
so
that
it'll
help
you
to
diagnose,
which
one
is
a
good
candidate
for
putting
into
them
the
MCD
Ram,
so
I
think
then
she's
going
to
talk
more
about
reaching
later,
okay
and
Brendan,
and
also
to
touch
upon
this
morning
as
well.
A
So
there's
also
another
one
and
called
Auto
HD
WM
library,
and
for
this
one
you
don't
have
to
identify
which
one
to
put
into
a
CDN.
But
you
say
I
give
you
the
criteria.
If
this
you
know
variable
is
bigger
than
what
size
or
whatever
size
is
say,
a
bigger
than
4k
or
my
size,
the
perching
4k
and
8k.
In
these
examples,
then
I
just
load
module
and
then
use
something
like
Mary
was
to
set
that
and
then
programming
months
and
all
at
one
time
and
a
kick-out
take
care
of
putting
those
into
MCD
Ram.
A
So
basically,
I've
talked
about
all
the
things
about
I
want
to
talk
about
about
how
to
compile.
Is
that
just
a
summary
these
you
build
or
logging
nodes
like
you
do
now
and
you
use
provide
libraries
and
if
you
need
more
libraries,
let
us
know,
and
then
the
takeaway
is
that
module
swap
crĂªpey
has
what
and
crĂªpey
Mike
KL.
That's
the
easiest
way
to
do
build
or
if
you
absolutely
want
a
binary
to
run
on
both
architectures
and
for
Intel
compiler.
You
can
use
another
amp.
A
Workaround
alternative,
then
keep
in
mind
about
using
MCD
RAM
or
talk
about
more
a
little
bit
later.
So
any
questions
so
far
any
question
in
the
chat
room.
Alright,
then
I
go
about
running
jobs
so
for
learning
jobs.
We
want
also
always
emphasizes
that.
There's
so
many
core.
So
many
you
know
the
68
Corazon,
Aquino,
node
and
therefore
hyper
stress.
So,
where
you
put
your
process,
where
you
put
your
threads,
is
called
affinity
so
process
affinity.
You
find
your
MPI
tasks
to
these
CPUs
thread
affinity.
A
Is
you
bind
thread
to
the
CPUs
that
are
allocated
to
that
MPI
tasks?
That
threats
belong
to
you
and
the
memory
of
these
affinities
are
essential.
If
say,
you
put
all
your
MP
attacks
on
one
core,
obviously
won't
get
good
performance,
so
we
want
those
to
be
evenly
nicely
spread
out
and
then
don't
over
allocate
don't
bind
on
two
different
threads.
So
that's
that's!
That's
what
we
try
to
achieve
here.
Remember
if
anything
as
well,
where
you
put
your
arm
data.
A
A
This
is
the
one
thing
we
want
to
achieve:
good
affinity
for
surrounded
processor
and
an
affinity
memory.
Another
thing
one
achieved
is
that
we
want
to
promote
portability,
so
we
we
know
that
Intel
compilers
has
some
specific,
specific
settings.
We
try
to
see
if
we
can
find
another
more
affordable
way
that
that
works
for
all
compilers,
that
the
open
in
p4
standard
we
try
to
to
use.
A
Okay,
just
keep
this
in
mind,
and-
and
so
we
also
mentioned
about
to
so
many
different
cluster
mode,
the
so
many
different
memory
modes
and
they're
all
going
to
affect
the
Elm,
the
affinity
there's
cluster
mode.
There
are
the
quadrant.
We
used
more
a
lot.
We
said
quad
cache,
quad
flat.
That
quad
means
is
the
quadrant
mode.
Just
you
heard
about
in
the
first
talk
today
and
for
those
there
are
no
no
no
Numa
damages.
A
One
domain
is
all
CPUs
in
it,
and
then
we
have
some
new
measurements
and
since
e2
and
SNC
for
there
now
we
can
start
to
see
more
human
domain
in
the
memory
and
different
cache
and
such
and
when
you
say
flat,
you
introduce
another
more
when
dimension
of
human
them
is
I
want
to
show
you
one
of
the
utility
called
Hwa
location
basically
provides
Hardware
locality
information,
so
I
went
once
you're
on
the
compute
no
teach
around
us.
This
is
an
example.
You've
got
some
hard
cash,
so
here
you
see,
you
know
this.
A
A
We
try
not
to
you
know
to
allocate-
and
you
know
around
pronounced,
like
bricks
boundary
of
a
type
you
can
as
much
as
you
can,
the
decimal
thing
and
another
utilities,
Numa
CTL
capital
H
for
hardware,
and
then
you
can
now
you
can
look
at
how
many
CPUs,
which
CPUs
belong
to
which
Numa
node
in
their
distance.
So
this
example
is
the
68
core
quad
cache
and
we
see
six
to
eight
cores
and
we
have
times
for
272,
logical
cores
or
in
slurms
way.
A
These
each
logical
coil
is
a
each
CPU,
so
you
have
sixty
eighty
sixty
eight
cores
and
272
CPUs
and
this
is
listed
as
below.
So
no
zero
has
CPUs
from
0
to
271,
although
so
CPU,
0
and
CPU,
68
and
n,
CPU,
136
and
then
CPU
I
think
204,
actually
1
CPU,
then
in
a
full
hyper
thread
on
that
CPU
and
then
the
quad
cache
mode.
The
has
all
the
memories,
the
ninety
six
megabytes,
memories
of
DDR
memory,
the
other
16
gigabytes
of
MC
DRAM-
is
not
shown
because
it's
chat
is
cash.
A
It's
not
the
like
explicit
memory.
It's
not
shown
and
there's
only
only
one
numero
man
or
it's
called
Numa
note.
The
numerous
tense
is
zero
itself
to
itself.
Now
we
look
at
a
flat
note,
the
flat
note.
You
will
see
that
the
very
similar
new
Numa
Numa
Dino
zero
has
all
the
CPUs
and
it
has
96
members,
I'm,
sorry,
96
came
back
and
human
note.
A
Basically
you
consider
the
salty
you
want
allocate
memories
within
each
Numa
domain
and
for
each
VM
we
would
and
the
ways
to
run
to
allocating
memory.
We
talk
about
Numa,
Numa,
CTL
ways,
P
M
options.
You
won't
actually
put
in
the
correct
neumann
note
ID
there
so
that
they
are
allocated
to
the
high
venomous
memory.
A
A
So
this
is
like
128
total
it's
within
the
272
CPUs
available
on
node
and
16
times
a
seems,
a
good
number,
and
if
on
house,
all
I
might
have
been
using
this
along
and
it
works
fine.
So
on
care.
Now,
let's
give
it
a
try,
so
I
say:
export
OMP
number
thread
equals
8
and
there
are
two
OpenMP
options:
I
want
them
to
proc
bind
yes,
I
want
them
to
distress.
A
A
What
we
want
is
this
very,
very
clear
ones.
So
I
have
say
MPI.
This
is
one
color
MPI
rank
0
and
8
threads
and
once
I
want
them
to
be.
So
let
me
explain
this
a
little
bit
more.
So
this
plot.
What
I
show
here
is
the
first
line
here
that
is
all
the
red
numbers.
The
physical
course
zero
to
sixty
seven
and
then
I.
Give
this
number
here.
A
As
CPU
harm,
hardware's
CPUs,
our
logical
CPU,
68,
136
204,
so
that's
actually
one
CPU,
which
is
Hardware
threads,
and
then
this
is
second
CPU
169,
137
and
205.
So
what
I
want
is
my
first
MPI
rank.
I
want
it
to
be
on
because
I
have
16
total
an
MPI
ranks.
A
total
of
68
course
looks
approximately
1
MPI
rank.
Should
you
use
for
cords,
so
one
the
MPI
rank
0.
To
be
honest,
these
four
chords
and
then
I
have
8
threads.
So
it's
rats
I
want
them
to
use.
A
You
know
spread
out
the
first
two
threads
would
be
on
0
and
136.
The
second
two
threads
will
be
on
1
and
137
and
so
on,
and
then
this
is.
This.
Color
is
my
second
MPI
rank
and
my
MPI
rank
1
I
want
them
to
be
on
cores
from
4
to
7,
and
then
my
8
threads
to
be
evenly
also
because
I
use
spread
I
want
them,
spread
out
to
be
on
these
different
Hardware
threads.
So
this
is
what
I
want.
A
A
A
C
A
These
cores,
so
let
me
show
you
here
so
I'm,
giving
it
16
means
I'm,
giving
it
full
course,
and
all
these
hard
words
rest
using
these
four
cores
I'm,
giving
this
16
CPUs
to
my
MPI
rank
0.
This
is
what
is
this
C
4
and
then
again,
my
next
engine
ranker
also
get
16
threads
just
like
pre-allocated.
Once
you
clear
air,
look
at
these
number
of
CPUs
and
then,
however
number
of
threads
as
long
as
number
of
threads
smaller
than
that
and
it
all
it
won't
over.
A
How
do
you
say
over
allocating
and
then-
and
you
can-
you
can
see-
I
have
a
thread.
Only
then
I
will
try
to
spread
them.
I
would
have
to
thread
to
be
on
0
1,
36,
1
and
1
37,
but
if
I
do
have
a
16
thread,
I'm,
ok
I
can
have
each
thread
to
hang
on
these,
although
you
can
also
allow
for
us
to
be.
You
know,
you
don't
have
to
find
each
thread
to
each
CPU.
A
You
can
allow
them
to
migrate
a
little
bit
if
you
want
that's
like
when
you
say
Oh
empty
place.
This
is
the
OMP
places.
If
you
want
to
say
it
to
be
threads,
then
it's
to
one
CPU.
Only
if
you
want
to
say
on
P
places
equals
course
then
you're
allowing
your
threads
to
be
floating
within
these
cords.
That's
the
choices.
You
can
also
bind
to
sockets
other
options
anyway.
A
So
this
is
basically
our
essential
points
that
you
want.
We
want
to
our
users
to
use
C&S,
CPU
options,
buying
options
and
CPU
bind.
If
we
are
you,
then
the
MPI
cores,
if
you're
not
using
more
than
64
cores,
you
bind
them
to
course,
once
you're
using
MPI
ranks
over
68,
then
you
want
your
MJ
CPU
bind
to
the
thread
and
another
extension
run
times.
Options
we
want
our
users
to
use
is
to
use
onp
prop
bind
in
are
empty
places.
A
Initially
we
tell
users
to
the
ETL
basically
defaults
to
in
trying
to
spread
and
create
compilers
as
well,
but
you
can
do
compile,
has
different
usage
of
that,
so
it
won't.
It
won't
come
having
consistent
you
like
layout
for
users,
so
we
recommend
users
to
use
true,
and
then
we
also
filed
a
bug
to
Google
compiler.
They
have
things
fixed
it,
but
it
hasn't
been
released
yet
then
Oh
empty
places
and
you
can
use
different
options.
A
So
these
are
two
so
the
first
slide
about.
If
they
run
some
settings
will
process
for
an
affinity,
and
this
slide
is
about
runtime
settings
for
the
man,
engineer
and
memory
affinity.
So
I
think
I
have
already
talked
about
these
points.
The
rep,
if
it's
over
16
and
yeah,
the
dash
M
or
just
without
preferred
means
and
is
strictly
enforced,
and
if
you
happen
to
be
over,
your
application
will
fail
and
is
a
preferred
option
down
here.
A
Right
this
slide,
we
want
to
talk
about.
How
do
you
request
different
modes?
So
we
mentioned
a
few
then
actually
there's
a
combination
that
you
can.
You
can
choose
whichever
and
we
set
aside
some
most
most
of
our
null
and
over
six
thousands
of
nodes
are
fixed
at
the
quad
cache
mode.
We
have
about
three
thousand
nodes
that
are
allowed
to
reboot.
So
once
one
requested,
you
can
say
capital,
c,
KL
and
then
different
in
new
month
mode.
A
Quad
essence,
these
two
others
and
then
the
MCD
ramp
and
cache
flat
split,
although
I
do
want
to
mention
essence
before
is
not
generally
recommended,
simply
because
it
was
that
you
have.
You
know
for
newer
domains
that
they're
not
even
because
there
are
total
of
68,
so
you
get
nine
nine,
seven,
seven
tiles,
a
ninety
nine,
eight,
eight
tiles
per
new
middleman,
and
it's
going
to
be
a
mess
that
because,
when
you
have
service,
34,
MPI
tasks
and
36
in
your
tasks
speech-
and
you
cannot
easily
say,
there's
another
option
you
want
to
use
and.
C
A
Visualization
I'm
going
to
mention
about
it
later
we
try
to
say
I
want
to
just
use
the
extra
coerced
for
the
cost
of
reservation.
The
but
slurm
doesn't
support
it
nicely.
So
basically,
we
do
not
recommend
SNC
for
others.
You
can
say
okay
now
what
other
options
you
can
do,
SNC
to
split
and
then
most
likely,
if
you
do
that,
you
know
your
job
will
end
up
in
the
KML
reboot
partition,
which
means,
if
there's
currently
no
such
mode
note
available.
A
You
will
have
to
reboot
wait
for
a
rebirth
and
it
takes
about
maybe
20
to
40
minutes
and
putting
a
program
or
I
think,
depending
on
how
it
works.
Sometimes,
if
one
of
your
the
node
would
already
will
will
be
harnessing
so
so
your
job
would
be
assigned
to
the
reboot
partition
and
it
will
paint
your
weight
and
once
a
job
would
be
nodes
will
be
allocated
to
you.
The
slurm
is
not
smart
enough
that
it
all
still
allocated
to
you.
A
It
means
not
in
the
Mon
you
desire,
then
you
getting
get
into
this
configuration
stage.
If
you
do
queue,
monitoring
command,
xq,
something
you
see,
the
state
shows
as
CF
means
configuration
and
in
the
demon
list,
num
of
the
node
list
that
you
have
got
already,
but
then
job
won't
start
and
but
but
the
good
thing
is
that
it
won't
consume
your
wall.
Time
request
say
it
takes
14
minutes
and
your
water
request
is
only
one
hour.
You
still
have
one
hour
to
round
after
the
configuration
is
completed.
A
The
first
one
is
the
MPI
in
cod
cash
and
a
few
flags
just
wanna
mention
this
example
shows
one
knows
through
the
petition
to
the
regular
partition
one
hour
and
you
want
you
scratch
license
and
you
want
use
the
reserved
two
cores
for
specialization.
I'll
mention
that
later
in
the
later
slide,
what
this
is
for
and
then
the
most
important
things
you
want
a
quad
cache
node.
A
So,
even
though
this
is
an
MPI
code,
we
still
recommend
you
say:
oMG
numbers
equals
one.
It
is
because
some
of
the
because,
when
the
wrapper
compile
it
is
sometimes
thinking
to
multi
state
libraries-
and
you
don't
want
those
two
to
by
default-
most
multiple
live
like
Intel
would
use
whatever
number
of
CPUs
to
72.
So
that's.
A
Alright,
so
now
you
have
for
this
example,
we
have
64
MPI
tasks
and
allocate
for
CPUs
for
it,
which
means
actually
means
a
one
core
one
core.
This
is
rank
0
and
another
core.
You
also
have
4
CPUs
those
event
1
because
of
you
bind
to
core,
so
the
MPI
tasks
will
bind
to
this
core,
but
it's
also
actually
can't
freely
move
within
these
core
if,
if
CPU
thinks
and
it's
needed,
so
this
is
each
core
drink
or
actually
it
is
a
pile,
but
it's
okay
for
for
n-type
GI
ranks
you
have.
A
You
can
have
two
different
MPI
ranks.
The
thing
is
we
do
not
want.
A
merge
from
you
know:
rank
zero
in
:
and
a
thread
0
from
full-length
rank
one
in
there
like,
like
cross
cross
boundary.
That's
not
good!
So
the
two
purely
empirical
here
64
3
4,
and
in
this
case
the
the
last
four
cores
is
not
are
not
used.
So.
A
Here,
I,
don't
I'm
almost
the
same
as
previous
slide,
but
now
I
have
16
inch
eye
tasks
so
for
16
avatars.
If
I
want
them
to
eat
more
evenly
distributed
I
give
more
CPUs
to
it.
I
give
basically
for
physical
CPUs,
which
is
16
logical,
CPUs
here,
so
the
core
0
1
2
3
is
Meg
0
and
then
all
the
way
down
to
around
15
is
down
to
63
and
again.
The
last
four
are
not
used.
A
So
this
one
now
I
want
to
use
different
number
of
OpenMP
strands
so
want
to
use
four
threads.
Secure,
I
haven't
used
it
before
this
one
just
at
number
of
threads,
but
then
up
for
16
to
64,
M,
PI
X,
so
again,
I
give
it
C.
4
means
that
reg
0
will
be
here
and
then
smooth,
read,
there's
no
opening
piece
about
binding
at
all.
So
the
thread
or
you
know
freely
flows
within
these
four
CPUs.
A
So
next
one
then
I
give
something
OMP
bind
equals
true
or
empty
places
equals
threads
now
I
want
to.
This
was
to
be
binded
and
I
want
each
CPU
actually
pings
to
a
thread
so
with
the
four
again
64mph
half
and
four
threads
that
each
thread
so
thread
0
for
task
MPI
rank
0
would
be
this
one
and
threat
one
of
my
rank,
0
or
binding
68.
A
So
this
is
pretty
similar
to
the
previous
one,
except
as
more
threats.
I
have
eight
sleds
and
fewer
MGI
tasks,
so
60
MPH
asks
16
threats
again
I'm
using
the
first
54
course
only
and
I'm,
giving
it
64
16
CPU,
logical,
CPUs,
but
I'm
only
having
8
threads.
So
so,
not
if
not
all
the
cores
with
English
will
be
used
for
4
threads,
because
I'm
still
binding
for
us
to
each
CPU.
So
you
will
see
that
thread.
0
2
thread
7
are
binded.
A
Ok,
now
flat
I
think
is
the
flat
mount
if
I'm
not
allocating
anything
to
MTD
Ram,
it's
the
way
of
finding
everything
exactly
the
same,
and
the
command
will
be
the
same
as
well.
If
you
don't
use
the
MTD
Ram,
and
this
is
thousands
64c,
4
or
16,
you
know,
and
so
basically,
what
we
recommend
is
say
you
have
especially
the
number
of
MPI
tasks.
Is
you
know
2
to
the
power
or
something
just?
A
C
A
I,
don't
have
a
separate
slide
for
that
if
I
want
to
use
MCD
land,
all
I
need
to
do
is
add
to
this
option.
My
Numa
CTL
mm1
or
my
mmmm
bind
equals
preferred
something-something
just
adding
to
here
for
because
and
what
I'm
targeting
is
that
for
the
quad
flat
mode,
the
numerator
main
one
is
the
HP
into
the
MCD
Ram.
A
So
just
want
to
quickly
go
over.
You
know
what
does
CPU
buying
and
oMG
places.
I'm
look
like.
So
it's
about
an
illustration,
a
a
core.
Actually,
eight
I'm,
sorry
a
tile.
So
the
top
one
is
the
core
1
and
then
0
say
the
bottom
is
the
cool
one
and
core
0
have
CPUs
0,
68,
136,
204
and
coal.
One
has
these.
A
So
when
you
say
CPU
by
an
equals
course
means
a
CPU
can
migrate
within
these
coding
a
core.
But
if
you
say
CPU
buying
equals
threads,
then
the
CPU
is
buying
into
a
particular
thread
and
you
say
oh
empty
places
equals
thread
again.
That's
nothing.
If
I
have
a
OpenMP
threat
will
be
buying
to
a
typical,
a
logical
service
CPU.
A
But
here,
if
I
say
on
keyphrases
equals
cores,
then
in
my
MPA
I
can't
you
have
say
four
threads
within
this
core,
but
then
each
of
them
can
can
float
around
or
oh
I
could
have
even
went
thread
only
in
that
with
that
for
that
core.
But
if
I
say
on
p-princess
in
cost
cores,
that's
right
can
still
use
any
CPU
during
the
execution.
A
So
that's
just
a
concept
at
the
core
is:
how
do
you
buy
if
you
buy
into
cores?
If
you
find
a
thread
without
the
impacts?
Are
right,
so
I
just
want
to
give
up
this
a
different
topic.
Basically,
we
showed
all
these.
You
know
batch
scripts
and
everything.
What
about
if
I
want
around
some
interacts
with
quick
runs
and
do
a
big,
easy,
debug
things
so
the
capabilities
available
when
it's
debug
I
think
you're
familiar
with
debug.
A
The
limits
of
running
debug
is
the
maximum
512
nodes,
30
minutes,
and
they
also
actually
we
have
on
Cori
with
okay.
Now
we
don't
have
reserved
notes,
but
I
has
well
used.
We
and
we
have
reserved
notes
from
for
debug,
so
it's
easier
to
to
get,
but
so
far
for
curry
and
together
give
up.
Node
is
not
too
hard,
so
we
monitor-
and
we
don't
have
reserve
notes
for
this
yet
monitor.
If
is
needed,
we
can
implement
there's
a
limit
of
30
minutes
and
one
job
per
user
and
you
can
queue
up
to
five
jobs.
A
So
that's
debug
sometime.
You
have
to
wait,
but
the
currently
wait
is
not
it's
not
too
long.
At
all,
there's
another
capability:
it's
called
interactive,
the
way
to
use
it
is
just
another:
its
QoS
equals
interactive
and
very
similar.
However,
it's
much
higher
limitation-
it's
like
per
repo
you
can
up
to
you-
can
use
up
to
20
nodes
that
and
in
a
each
user,
can
only
run
one
job
at
a
time.
A
A
The
goal
for
this
is
that
once
you
submit
that
either
you
get
a
notice
request
or
you
get
rejected.
They
know
to
not
available
it's
like
immediate
kind
of
response,
so
these
are
good
for
44
for
debug
quick.
Although
those
also,
of
course
the
limit
is
only
up
to
20
knows
maybe
you
need
bigger
size
to
do
your
debug.
Then
then
you
can't
use
interactive,
just
use
debug,
okay.
So
next,
a
few
set
of
slides
I'm
going
to
talk
about
some
of
the
recommendations.
Running
jobs
want
to
just
pause
a
little
bit.
A
Now
all
right
so
recommendations.
First,
we
want
to
talk
about
youth
which
use
huge
pages,
so
we
talked
about
compile
nothing.
Is
there
just
ftn
or
whatever?
But
if
we're
using
huge
pages,
it's
very
easy.
We
just
module
load,
creepy
huge
pages.
The
reason
is
that
you
know
the
default
page.
Size
is
only
4.
K
and
lots
of
lots
of
tests
have
shown
that
using
huge
pages
are
beneficial.
We
want.
We
want
you
to
try
it
out.
A
These
are
applications
that
we
show
that
there
are
lots
of
creepy
previous
modules
available
for
m86
and
all
the
way
up
to
bunch
of
em,
and
we
found
that
MJ
Larry
is
already
helpful
if
you
can
try
other
mores,
but
this
starts
on
to
em
and
there's
a
notice
and
main
page.
A
You
can
check
it
out
how
more
informations
the
here's,
our
plot
vast
of
different
number
of
nodes
and
the
red
color
is
the
performance
you
get
from
using
huge
pages,
not
a
big
investment,
but
it's
good
achievement,
easy
and
cumin
to
use
that
so
recommend
to
use
huge
pages
and
another
recommendation
is
people
are
reporting
different?
You
know
performance
variations.
The
reason
is
that
if
your
jobs
allocated
to
different
number
of
nodes-
and
they
are
everywhere
in
the
routing-
it
could
obviously
affect
your.
A
You
know
MPI
communication
itself,
so
you
want
to
you
know,
control
them
to
a
more
smaller
set
of
cabinets
nodes.
So
this
a
concept,
the
cost,
which
is
basically
about
384
1/2
cabinets
in
a
switch.
So
the
floor
can
request
number
of
switches
you
want
and
maximum
hours
number
of
hours.
You
wish.
You
would
like
to
wait.
It's
like
sort
of
extra
weight,
but
not
necessary,
because
if
your
job
fitting
back
here
is
okay,
but
otherwise
you
know
I'm
willing
to
wait
a
little
bit
longer
just
say
my
jobs
can
have
a
closer
topology.
A
A
So
that's
a
recommendation.
The
other
ones
called
zone
sort.
So
this
is
a
recommendation,
but
it's
also
mostly
already
down
by
default.
Don't
sort
the
issue.
It's
not
called
zones
also,
it
is
actually
a
solution,
a
a
technology
that
applied
to
it.
The
problem
is
that,
with
the
quad
cache
notes,
the
time
goes
on
in
the
cache
size
that
you,
the
the
cache
conflicts
will
increase.
The
reason
is
that
it
is
direct
map
so
that
the
addresses
you
know
modular
the
cache
size.
A
If
they
are
saying
there
cannot
be
on
the
same
page,
so
they
have
to
go
out
and
come
back.
So
all
these
would
affect
the
performance
there's
also
it
basically
what
it
does
that
that's
technology
is
di.
Whenever
might
it
now
it's
down
by
default
and
every
time
my
job
is
about
to
start,
then
it
does
sort
the
available
pages
for
you,
so
that
you
get
the
big
pages
first
and
then
is
the
options
you
can
turn
it
off.
A
You
can
try
to
set
a
number
of
different
numbers
of
that
in
RAM
in
you
know,
frequently
you
as
well,
you
can
set
up
means,
however
many
seconds
you
want
to
run
so
I
think
if
you
really
want
to
try
it
because
the
on
is
already
the
default,
but
if
we
really
want
to
try,
you
can
use
the
third
option.
So
that's
also
a
recommendation
and
another
recommendation
introduced
something
called
SV
cast
so
in
backing
the
old
torque,
more
voice,
our
it's
a
default.
A
They,
the
scheduler,
will
copy
your
excusable
onto
the
image
of
each
compute
nodes
so
that
the
start
time
that
from
VD
is
something
that
all
they
all
start
about
the
same
time,
but
without
it
they
could
have
a
big
delay.
So
our
recommendation
is,
if
you
have
some
job
a
job
say
bigger
than
1500
MPI
tasks.
Let's
do
a
menu
FB
cast
I
think
we
have
a
requesting
for
slurm
that
make
it
as
a
default.
Not
yet
so
what
you
do
is
they
s
peek?
A
At
my
code
to
some
temp,
the
temp
is
actually
on
your
own
computer
may
move
memory,
and
then
you
run
your
normal
extra
options
and
then
the
new
mercy
T
or
if
it's
our
flap
mode,
and
then
you
do
not
know
your
own
actual
excusable.
You
have
to
you
know,
replace
it
with
the
one
you
already
copied
over
to
the
temp
or
if
you're
doing
is
numerous
ETL,
then
you
can
do
this
in
one
step
just
as
from
be
cast
to
this
and
then
at
the
end
you
actually
do
you
still
use
your
own
extreme.
A
A
Capital
S,
is
a
basically
I.
What
I
want
is
I
want
to
preserve
how
many
cores
per
node
for
just
doing
OS
so
that
the
rest
of
the
nodes
won't
be
disturbed.
So
that's
good,
a
good
good
concept
and
things
I'm
using
say
most
basically
for
4
cores
out
of
six
to
eight
anyway.
Why
don't
I
just
use
it
so
I
think
probably
just
say
as
two
is
good
enough
and
some
people
would
say
four
or
one
doesn't
matter
much
things,
but
you
can't.
This
can
only
be
useful
as
batch.
A
A
Okay.
So
that's
from
our
you
know
a
set
of
different
recommendations,
not
in
the
general
running
jobs,
especially
not
in
the
has
more
money,
jobs
recommendation,
so
these
are
more
specific
to
KL
and
then
I
also
want
to
introduce
the
Wiener
has
a
job
script
generator
so,
which
should
be
helpful.
This
is
in
the
minor
risk
page.
If
you
go
to
my
nurse
and
then
you
go
from
left
side
search
for
jobs,
you
probably
have
to
expand
it
and
you
will
see
JavaScript
generator.
A
A
Then
I
was
also
want
to
touch
upon.
How
do
I
verify
the
affinity,
so
we
have
many
ways
of
verifying
affinity
system.
We
showed
you
all
these
cloth
illustrations,
but
there's
also
you
know
the
command
line
or
some
other
ways.
So
I
want
just
to
talk
about
each
of
them.
So
Korea
has
given
us
something
called
XP
hi.
A
Maybe
using
this
a
lot
whenever
we
want
to
check
if
it's
giving
us
good
affinity,
it's
basically
a
hybrid
MPI,
OpenMP
code
and
reports
back
tasks,
zero
and
zero
is
buying
two,
which
you
know
CPA
or
something
like
that.
So
for
our
users,
we
basically
horrid
in
our
preview,
those.
So
what
you
need
to
do
is
if
my
code
is
MPI
called
always
Michael
is
a
hybrid
code
and
if
I
use
one
of
compiler
enzyme
on
Cori
of
mo-mo
Edison
just
pick
one
of
the
binaries
and
stuck
into
whatever
way
you
are.
A
You
have
set
up
for
your
application
and
only
just
to
replace
your
binary
with
one
of
those
and
to
check
if
I'm
getting
the
thread
affinity
I
want.
Then,
if
it's
correct,
it's
all
good,
then
replace
it
with
your
own
binary.
So
you
be
sure
to
you're
not
being
punished
by
in
a
wrong
overall
allocation,
and
things
like
that,
and
so
besides
using
xph
I,
there's
also
two
for
Intel
compiler.
There's
a
KMP
affinity
wrong
timing
right
variable,
so
you
don't
have
to
do
anything.
A
You
just
set
that
to
Bobo's
and
you
can
use
your
own
application.
If
you
want
and
I'll
you
know,
report
something
to
you
and
for
create
compilers
or
something
similar
called
create,
OMP
check
affinity.
You
can
set
it
to
true,
but
also
for
something
to
you
for
both
of
those.
They
don't
give
your
MPI
ranks
at
all,
because
these
are
open
and
P
runtime
environment.
Only
they
don't
know
about
anything
about
MGI.
So
the
way
to
you
know
try
to
figure
out
what
so
have
zero
zero
smart
one.
A
A
So
then,
also
for
slurve,
this
also
a
CPU
bind
equals
verbose.
You
can
put
in
there
and
then
check
out
CPU
mask
and
form
it
for
memory
binding
and
is
also
memory
by
in
equals
verbose.
You
can
check
your
memory
affinity
and
also,
as
there's
a
Numa
stat
P
your
process,
ID,
and
when
your
job
is
running,
you
can
run
doubt
and
check
and
new,
more
usage,
a
Numa.
A
You
know
information
right,
then
I'm
going
to
talk
about
a
few
useful
commands
for
just
you
know
just
find
things
around
how
many
you
know
nodes
available
where
my
jobs
are
things
like
that,
so
this
is
one
basic
thing
is
asking
for
format
and
give
it
format
which
it
gives
to.
You
is
to
view
the
notes
in
different
features
and
the
allocated
idle
other
or
total.
Something
like
that.
So
in
this
report,
what
you
get
you
get
quad
cache
can
l
feature
14
total.
A
Let's
look
at
total
and
you
can
L
seven
total
cache
chronic
visited
to
key
in
our
cache
quad
78046
they're,
actually,
all
the
same
type.
They
are,
can
yell
podcast
the
order
doesn't
matter
then,
but
then
you
know
how
many
total
you
are
they
are,
and
you
could
also
check
in
how
many
idle.
If
you
look
at
the
second
column,
there
are
only
about
two
hundred
forty
idle
currently
right
and
now,
if
you
see
reports
back
to
CTU,
so
it
basically
it's
this
the
number
of
cores
times.
A
A
When
does
node
was
booted
or
something
like
that
or
state?
If
notice
allocated
right
now,
thank
state
equals
allocated
something
like
that
and
also
if
they
want
to
know
a
job
ID,
either
your
job
or
dropping
the
queue
that
you're
curious
about
you
can
around
F
control,
show
job
and
that
job,
ID
and
I'll
tell
you
the
job,
ID
and
the
job
state.
This
job
was
cute
when
I
checked
it
was
reason
with
priority.
A
It's
not
reaching
this
priority
and
when
was
the
submitted,
which
petition
submitted
to
it's
actually
action
for
something
that
needs
a
reboot.
It
asked
for
32
nodes
and
word:
were
you
know
where
this
command
is
used,
submit
this
job
and
was
the
job
was
submitted
from
lots
of
information?
So
you
can
check
you
can't
sometimes
you?
A
Maybe
you
want
to
save
a
history
of
your
job,
something
like
that
and
a
few
others
s
account
can
be
used
to
query
many
things
that
basically,
as
account,
is
a
command
to
query
this
lump
database,
and
sometimes
you
want
to
see
how
many
job
that
ran
in
the
past
I.
Can
you
know
my
username
something,
and
if
you
do
s
account
man
page,
you
can
do
you?
Can
you
can
list
give
a
format
you
can
list
all
my
jobs.
You
know
start
time.
A
End
time
elapsed,
number
of
nodes,
all
sorts
of
statistics
you
can
gather
from
with
account.
You
can
account
some
other
users
as
well.
It's
actually
it's
not
just
for
your
own
self
that
you
are
unable
to
do
and
then
sqs
or
SQ.
This
case
shows
the
queue,
jobs
and
they're
called
requested.
I
all
the
columns
that
number
of
knows
QoS,
that
does
jobs
to
have
requested
for
or
the
reason
why
this
job
is.
A
Do
you
know
in
the
queue
something
like
that
then,
and
then
recommend
you
do
some
of
these
are
main
pages
and
finally,
I
want
to
mention
so
the
few
jobs
use
lots
of
I/o
and
burst
buffer
is
not
part
of
KL,
but
it's
available
so
that
very
well
Corey
that
that
you
can
take
advantage
of
to
use
first
buffer
for
your
I/o
for
to
speed
up
and
please
check
out
the
web
page
to
learn
how
to
use
first
buffer,
I.
Think
I'm
through
two
more
slides
about
current
queue
structures
and
for
volcano.
A
However,
things
July
first
we're
going
to
charge
and-
and
we
have
already
enabled
all
users-
and
we
expect
more
users
to
to
come
to
Cori.
So
we
are
adjusting
and
making
changes.
So
a
possible
changes
are.
We
cannot
remove
this
partition.
Everything
you
request
will
be
QoS
for
s,
equals
debug
us
because
regular
or
premium
something
like
that,
and
then
we
will.
You
know
we
will
give
lots
of
stuff.
We
will
allow
you
to
submit
more
jobs
if
you
want,
and
we
will
also
change
the
bucket
little
bit
the
boundary
one
160g
could
have.
A
You
changed
to
a
little
bit
number
bigger
number
but
watch
for
announcements.
Nothing
is
in
there
yet
and
nothing
is
solid
or
decided,
but
that's
just
possible
change
your
skin
and
you
may
want,
if,
for
the
food
this
one,
if
you
use
qos
instead
of
P,
you
may
want
me
to
modify
your
script
script,
for
that
will
allow
some,
you
know
great
experience.
So
we
talked
about
charging
and
how
much
it'll
be
charged.
Is
this
okay
now
base
charge
factor
of
96
x,
number
of
notes,
use
x,
actual
wartime
and
it's
an
example?
A
A
So
look
for
announcement
as
well
so
summary
for
running
jobs,
youth
cap,
to
see
to
request
different
types
of
notes.
You
want
always
use
c
and
cpu
bind
with
the
strong
command
and
also
use
openmp,
proc,
bind
or
empty
places
to
fine-tuning
your
openmp
threat
and
for
memory
access,
FM,
CD,
ramp
use,
man
bind
or
Numa
CTL
and
also
take
to
these
other.
The
high
level
running
jobs
and
and
also
I,
want
to
like
to
ask
you
to
take
advantage
of
the
JavaScript
generator.
A
It's
committed
it's
somewhere
in
the
in
there
really
Holly
we
committed
French,
so
I
asked
them
a
question.
One
was
officially
released
version,
I
haven't
referred
back,
but
then
then
we
could
tell
users
to
use.
Omp
profile
equals
bread
for
all
three
and
compilers.
Once
we
have
that
version,
but
other
than
that
I
call
compiler
is
what
we
recommend
is.
Is
that
the
Intel
compiler?
Is?
You
know
it's
for
Intel
processors?
You
know
it's
always
a
good
native
compiler
to
use,
but
GCC
is
also
available,
and
it's
like
for
lots
of
application
packages.