►
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
So
we
heard
a
lot
already
quite
in-depth
things.
What
I'm
going
to
talk
about
with
all
these
options,
you
heard
about
basically
I'm
giving
you
like,
like
more
like
very
like
an
overview,
what
what
is
really
beneficial
for
many
applications
so
because
you
have
this
place
or
are
things
like
configuration
options,
but
actually
most
of
them
usually
won't
won't
help
you
a
lot
so
I
will
basically
tell
you
what
helped
for
us
and
for
us
means
from
the
nurse
exercycle
scientific
application
program.
Point
of
view.
A
So
we
are
what
it
is
and
I
will
just
explain
like
in
the
next
the
following:
it's
month:
okay,
so,
okay!
So
so,
first
again
the
differences
between
like,
for
example,
edit
land
and
what
we
have
in
quarry
KNL.
So
we
have
twice
as
many
nodes
now.
This
is.
This
is
quite
throughput
wise.
This
is
quite
quite
quite
good,
but
also
note
that,
like
we
have
now
much
more
cores
per
CPU
and
as
well
as
much
more
Hardware
threads
per
CPU.
So
when
you
target
a
quarry
KNL,
you
want
to
exploit
that.
A
If
you
don't
you,
basically
your
quote
once
and
for
well,
because
the
single
core
speed
is
like
half
as
slow,
okay,
so
half
so
check
like
twice
as
fast
on
engine,
for
example,
compared
to
Coronel,
is
even
worse
for
for
a
half
long.
On
the
other
hand,
you
have
much
wider
vector
units,
and
you
have
two
of
these
right
so
before.
If
you
had
codes
which
do
not
use
the
vector
units
a
lot
you
had
like
maybe
effect
of
forage
difference
which
in
the
end,
basically
not
everything
vector,
is
Michael.
A
Usually,
then
it
turns
out
to
be
factor
2.
Maybe,
but
here,
if
you
do
not
vector
as
well
it
you
can,
you
can
use
a
factor
of
16
and
performance.
So
it's
like
more
than
order
of
magnitude,
and
you
don't
want
that.
The
another
difference
is
talked
about
this
already
Edison
or
like
Ivy.
Bridge
processor
has
30
megabytes
as
free
cash,
which
helped
a
lot
in
many
applications
more
than
you,
you
realize
or
not,
and
the
KNL
doesn't
doesn't
have
that
so
and
you
can
think
about
okay.
A
We
heard
that
before,
like
the
in
quad
cache
mode,
for
example
the
the
DDR
excess
cash,
but
this
cache
is
only
like
four
hundred
fifty
gigabytes
per
second
fast,
whereas
an
l3
cache
cache
still
about
a
terabyte
per
second
or
something.
So
that's
the
use
different
and-
and
there
are
other
caveats.
On
the
other
hand,
you
have
this
huge,
sixteen
gigabytes
of
memory
which
are
like
which,
which
we
can
pull
data
from
a
four
times
as
fast
as
before.
A
So
you
want
to
make
use
of
this,
so
you
think,
okay,
what
can
we
do
just
recompile
the
code
and
go
right,
so
you
have
so
a
Candela
686,
64
compatibles.
You
can
just
create
your
3,
compile
code
technically
and
just
use
it
as
we
su
did
before
or
not
even
recompile
just
use
it
technically.
You
can
do
that
and
also
self
hosted,
so
no
need
for
offloading
and
complicated
promise
for
this,
so
you
can
just
technically
just
run
it.
So
what
happens
if
you
do
that
at
least
so
this
virtual
illustrators,
these
are
selected.
A
These
are
codes
from
the
Mesa
program
and
this
is
the
performance
when
you
just
take
the
original
codes,
compile
them
on
on
on
KML
and
compare
the
performance
on
Edison
or
Cori
caramels
versus
Edison.
These
are
the
gray
bars
and
then
do
the
same
for
Cori
Haskell,
and
you
see
something
like
this,
and
this
doesn't
look
very
well.
So
the
median
media
not
average
speed
up,
is
15%
with
respect
to
edit.
If
you
just
do
this
and
the
medians
meter
process
flows,
even
it
slow,
it's
slower
than
hassle,
because
hassle
has
a
very
strong
processor.
A
So
what
you?
What
you
see
here
is
that
some
of
these,
so
this
is
a
median
speed
up
right.
Some
of
these
applications
perform
very,
very
badly
on
KML.
So,
like
you
get
like
30%
of
what
you've
got
an
editor,
and
so
this
is
module.
So
this
is
like,
like,
like
a
cross
section
about
different
applications
or
special
applications.
A
They
are
like,
like
pick
a
particular
cell
code
and
and
and
all
sorts
of
different
applications,
and
this
is
like
I
think
so,
if
there's
a
high
probability
that
if
you
have
an
application
that
is
fit
somewhere
in
this
picture,
so
what
you
also
can
see
is
that
some
application
benefit
heavily.
Those
are
mostly
mostly
applications
which
are
bandwidth
bound,
so
I
will
talk
about
this
later.
A
So
the
thing
is
that
you
should
optimize
your
code
and
why
should
you
optimize
the
code?
Okay,
it's
easy
right.
You
get
more
for
your
bucks
right,
so
you
can
use
you
can
make
efficient
use
of
quarry
canal
and
I
can
show
you
if
you
have
like
codes
which
are
not
really
optimized
at
all.
There's
a
fast
success
possible,
because
when
you
just
look
into
this
code,
there
lots
of
low-hanging
fruit,
you
can
just
just
pick
and
you
might
get
a
good
speed
up
already.
A
On
the
other
hand,
also,
it's
not
it's
not
completely
pointless
or
worthless
to
optimize
for
KL,
because
he
Tribune's
architectures
are
energy
efficient
and
they
will
stay
around
so
probably
future
procurements,
not
only
for
our
centers
but
like
in
the
like
looking
at
the
HPC
landscape.
Nowadays,
he
genius
architectures
are
the
future.
So
I
can
tell
you
if
you
need
to
optimize
your
code
to
it
now
right.
A
So
you
will
definitely
get
a
benefit
for
most
of
it
and
not
not
on
loans
or
not
only
on
KL.
That's
the
desktop
side,
the
downside,
of
course,
and
that
I'm
totally
aware
of
the
effort
for
most
the
most
beneficial
optimizations
are
really
hard
one.
So
you
have
to
restructure
code,
restructure,
datas,
their
data
structures,
probably
and
some
codes
are
really
problematic
because
they
are,
they
are
not
right,
they're
not
easy
to
make
Fred
safe
in
any
way
without
major
changes
in
this
part.
A
So
this
can
happen,
but
don't
be
too
afraid
to
just
think
about
it
and,
of
course,
this
investing
in
future
point
is
also
downside.
What
is
my
bet
on
the
wrong
horse
might
have
what
is
intercept
in
two
years
or
we
leave
this
HPC
segment
or
all
together.
I
can
assure
you
there
might
be
other
windows
which
will
step
inland
which
have
similar
approaches,
so
don't
hesitate
like
considering
what
I'm
going
to
tell
you.
So
when
you
optimize
you
can
do.
A
We
have
certain
things
to
consider
like
seeing
a
note
multi-node,
maybe
I/o
if
you
are
a
or
heavy
applications
and
your
course
should
always
start
on
single
or
performance,
and
why
do
you
do
that?
Because
it's
a
lot
of
reasons,
the
first?
It's
the
easiest
target
right,
you
can
you
can
you
have
to
write
a
fast
turn,
runtimes
for
debugging
and
profiling,
whatever,
because
you
only
need
one
node
and
also
it's
important,
because
even
if
you
have
like
a
multimode
case
and
you
own
multi
node
application,
you
want
to
run
it
on
large
scale.
A
You
are
bounded
by
the
local
local
execution,
speed
right.
So
even
if
you
can
hide
the
communication
perfectly
behind
your
computation,
you
will
never
be
faster
than
a
single
node.
So
you
should
definitely
look
at
the
single
node
suite
and
there
are
also
many
profiling
tools
for
that
available.
So
we
discussed
a
couple
of
already
I
will
give
like
a
more
comprehensive
list
layer
when
it
comes
to
multi
node
performance.
A
There
are
few
optimization
opportunities
and
the
profiling
is
a
bit
tedious,
because
you
have
to
deal
with
all
these
different
processes
and
also
debugging
is
tricky.
When
you
do
these
organizations,
you
might
find
yourself
in
a
debugging
situation
that,
but
we
offer
tools
for
all
that,
so
you
should
not
hesitate,
trying
trying
out
things
that
will
mention
later
and
for
I/o
performance
there,
not
many
things
you
can
do
at
the
moment,
but
a
couple
of
couple
of
suggestions
I
can
offer
so
for
a
single
node
case.
So
what
do
I
do
so?
A
I
have
an
application
and
I
want
to
know
what
the
problem
is.
Why
is
it,
for
example,
2x
lower
on
KL
and
Edison,
so
the
first
thing
very
important
get
to
know
your
application
all
right,
so
I
don't
know
who's
on
the
line.
There
might
be
like
some
very
specialized
domain
scientists,
but
in
general,
do
not
assume
you
already
know
your
application,
because
even
like
I
talked
to
even
so
I
come
from
lettuce
QCD
background
even
there.
Sometimes
you
ask
what
what
is
actually
the
problem
here
and
then
they
tell
you
out.
A
I
don't
know
justice.
Maybe
this
is
in
your
algebra
routine,
but
actually,
when
you
look
at
it,
it's
not
the
case.
So
it's
important
to
determine
hot
spots
right.
So,
for
example,
you
can
do
that
by
manual
timing,
routines
like
you,
take
subroutines
or
like
loops
or
whatever
you
think
takes
a
lot
of
time
and
time
it.
You
can
do
that
with
manual.
Timers
will
be
very
careful
to
use
self,
read
safety
and
and
put
in
synchronization
berries
for
MPI,
but
in
general
this
looks
like
a
very
simple
approach
and
easily
portable.
A
You
can
also
use
profiling
tools
which
do
the
job
for
you
and
what
one
I
like
very
much
is
Cray
pet.
It's
like
basically
loading
a
module
and
then
you
just
decree
if
you
use
the
credible
for
compiling
they
technically
just
annotate
the
code
and
in
the
end
you
will
get
like
for
all
the
major
sections.
You
will
get
timing
information
additional
to
some
MPI
I
think
MPI,
a
message,
API
statistics.
A
I
would
suggest
just
trying,
starting
with
Cray
pet,
just
time
and
see
what
release
that
would
wreak
a
consumer
lot
of
time,
and
that's
also
something
in
between
these
here,
which
is
map
which
are
which
is
very
nice
and
comparable
lightweight,
because
that's
more
like
a
sampling
approach,
then
bd2,
it's
really
heavy
weight
and
like
a
data
collection
of
effects
like
big
big
kernels,
can
take
a
very
long
time.
So
this
is
also
nice
opportunity
to
try
that
out
so
assume
you
times
your
application.
You
found
your
hotspots.
So
what
do
you
do,
especially?
A
What
features
shall
I
target
right?
So
you
have
an
application,
a
hotspot
which
take
like
90%
of
the
time
on
your
application
and
then
yeah
so
shall
I
use
many
threats
or
try
to
vectorize
that
thing
or
try
to
go
for
the
more
complex,
intrinsic
and
your
application
or
like
do
I
just
move
to
m2
DRM
or
use
them
to
do
it
more
efficiently.
So
what
shall
I
do
from
that?
And
the
answer
is
understand
these
hotspots.
A
Now,
okay,
so
I
found
them
understand
them
and
then
for
each
of
these
cases
you
can
end
up
with
you
have
like
certain
options.
For
example,
if
you
are
compute
about
what
that
means,
I
will
tell
you
later.
Then
you
flow
all
threat
so
that
you
can
you
vector
eyes
as
much
as
you
can
and
you
use
the
ISIS
or
the
complex
intrinsics.
A
When
your
memory
bandwidth
bound,
you
have
to
exploit
the
memory
key
or
hierarchy
more,
and
you
definitely
also
want
throw
fretted
it,
because
more
frets
can
can
situate
the
bandwidth
better
right.
So
a
single
fret
usually
cannot
saturate
the
bandwidth
fully
and
if
you
are
latency
bound,
which
is
like
a
like
a
more
complicated
things,
then
usually
what
what
can
help,
for
example,
more
frets
and
vectorization.
So
I
put
yes
every.
A
I
mean
like
more
new
technology.
You
do
not
want
less
friends.
You
will
technically
always
try
more
processes
on
one
thread.
That's
like
my
mic,
shake
on
it
so
try
to
so.
This
is
one
one
solution.
L
have
I
will
show
that
exploit
or
like
like
try
to
identify
and
exploit
all
the
pearls,
and
you
can
find
right
because
you
have
a
lot
of
it
and
you
want
to
make
use
of
it.
A
So,
let's
fight
in
the
sense
that
it
might
not
make
sense
to
utilize
all
the
hyper
friends
so
that
that's
true,
so
what
the
hardware
likes
like
one
fredtrip
core
like,
for
example,
on
at
least
64
course
YouTube
you
should
you
try
to
want
to
utilize
that
you
don't
only
fastest
your
Idol,
so
you
have
a
friends
different
story
because
they
share
like
execution.
You
know
something.
So
that's
that's
a
more
tricky
question
but,
for
example,
if
you
are
compute
bound
and
you
can
try
that
you
can
try
using
them
so
so
what?
A
A
We
set
it
already,
so
you
for
crave
wrappers,
you
can
technically
just
load
the
you
can
swap
the
hassle
model
to
the
Mike
model
and
then
just
compile
and
it'll
be
fine
for
interim
you
can
so
you
can
do
the
same,
but
you
can
also
manually
by
adding
this
optimization,
that
it
optimizes
to
the
avx-512
architectures
and
for
MU
you
just
pass
and
arch
KL
and
use
a
proper
open,
P
setting.
So
this
is
just
an
example.
A
A
Who's
not
like
very
experimental
friendly,
just
try,
10
L
quad
cash,
and
you
should
be
fine
for
many
many
applications,
and
you
don't
have
to
worry
about
entity
and
usage
for
the
moment,
because
it
will
be
done
in
the
background
for
you.
So
what
else
you
need?
You
need
to
understand
the
number
of
flops,
so
the
number
of
floating-point
operations
you
execute
not
per.
Second,
it's
the
number
like
the
total
number
of
floating-point
operations.
A
So
you
go
to
a
kernel
and
count
these
things,
so
you
can
either
do
it
manually,
which
is
a
tedious,
but
it's
nice
to
understand
somehow
the
order
of
magnitude
right
so
for
each
float,
an
addition
modification
you
count
plus
one
and
for
each
complex
multiplication.
You
have
six
because
they
have
four
multiplication
into
addition:
okay
and
then
for
loops.
You
multiply
with
the
trip
count
and
so
on
so
that
that
you
can
do,
and
then
you
get
an
approximate
flop
count
for
this
kernel.
A
So
we
have
the
documentation
the
website
how
to
do
that,
but
just
just
technically
just
copy
this,
and
we
have
this
start
and
stop
regions
which
basically
say
that
start
the
flop
collection.
When,
when
the
code
execution
passes
this
point
and
stop
it
here
right
and
then
you
will
technically
just
get
the
flops
produced
in
that
kernel
and
it
also
accounts
for
masking.
For
example,
KNL
supports
like
vector
masking
and
things
like
that,
and
it
will
account
for
these
things,
so
you
will
get
a
relatively
precise
flop.
A
Count
from
that,
and
it's
pretty
like
it's
a
no-brainer.
It's
just
that
the
runtime
of
this
thing
can
be
very
large
if
your,
if
your
profile
a
long
section,
so
you
should
technically
try
to
try
to
downscale
the
section
and
then
try
to
estimate
for
bigger
problems
or
for
more
realistic
problems.
What
this
flop
count
will
be.
A
The
next
thing
you
need
this
byte,
so
the
number
of
bytes
transmitted
from
main
memory.
Okay
for
this
kernel,
so
from
main
memory,
not
from
cache
okay,
so
you
can.
You
can
compute
this
manually,
okay,
you
cannot
so
and
we
don't
recommend
it
because
you
do
not
account
for
data
use
for
caching
right,
so
only
so
what
can
end
on
the
opposite
for
a
4k
by
a
so
what
can
happen?
A
For
example,
if
you
estimate
your
by
its
usage
by
hand
and
say
okay-
and
you
assume
that
everything
I
need
is
only
read
once
and
written
once,
for
example,
that
is
usually
a
bad
approximation
because
you
can
have
you
can
have
cache
evictions
and
all
these
kind
of
things
and
you
have
to
read
the
data
multiple
times
for
main
memory.
So
it's
good
to
have
like
an
order
of
magnitude
estimate
here,
but
definitely
I
recommend
measured
with
retune.
A
So,
for
example,
there
is
there
is
this:
you
can
annotate
your
code,
so
this
is
a
see
example,
there's
also
Fortran
equivalent,
for
that
we
will
keep
basically
the
ITT,
notify
library
and
then,
before
your
kernel,
you
insert
like
this
ITT
resume
and
then,
after
that,
an
ITT
post.
M
is
the
same
idea.
It's
with
ste.
A
So
this
is
the
time
you
got
in
the
first
first
run
when
you,
when
you,
when
you
put
in
your
looked
at
the
timings
for
all
your
kernels
right,
so
we
have
that
number,
and
then
you
plot
this
on
the
architectural
roofline,
okay
and
the
architects
roofline
is
given
by
the
minimum
of
the
memory
bandwidth
from
main
memory
times
your
aromatic
intensity
and
the
peak
flops
right.
So
how
does
that
look
like?
So
this
is
a
picture.
This
is
for
4k
L,
so
you
have
so.
This
is
basically
the
roof
line
for
DDR
right.
A
This
is
the
roof
line
for
for
high
band
memory
or
MCD
RM,
and
this
is,
for
example,
the
flip-flops
you
can
reach
when
you
do
not
vectorize
your
code,
and
this
is
like
the
maximum
peak
flops,
you
can,
you
can
get
okay
so
and
what
will
happen
when
you
compute
this,
this
asthmatic
intensity
and
the
performance
for
one
of
your
kernels.
You
will
end
up,
for
example,
here.
A
Okay,
and
this,
for
example,
tells
you
that
your
kernel
has
very
low
I
and
you
hang
at
this
roofline
your
memory
bandwidth
bound
and
probably
you
ran
the
code
out
of
DDR
because
you
hang
out
to
DDR
roofline.
So
in
that
case,
what
you
want
to
do
you
want
to
use
MCD
REM
to
make
your
code
faster
right
so
low.
Does
the
logarithmic
scale
right
so
this,
for
example,
health
here?
What
you
also
can
find
is
that
you
have
a
kernel
which
hangs
around
here
right
at
the
vectorization
roofline.
So
that
means
this
here.
A
It
really
helps
to
try
to
vectorize
your
quote
properly
so
that
you
can
break
through
that
and
get
a
better
performance.
So
when
you
do
that,
probably
you
you
end
up
at
the
instruction
level
parallelism
roofline,
so
this
is
a
this
is
so
this
is
when
your
quote
does
not
use
a
fuse.
Multiplied
multiplied
adds
some
matrix
multiplications,
you
have
a
lot
of
use,
but
you
can
make
heavy
use
of
this
use.
Booklet
or
some
Fourier
transforms,
but
not
every
kernel
can
so
that's
a
that
is
only
if
you
exploit
these.
A
You
get
another
factor
of
two
or
something
to
get
to
the
peak
peak
roofline,
and
then
there
is
something
in
between
and
that's
unfortunately,
many
kernels
are
like
this.
You
are
somehow
I'm,
not
really
memory,
not
really
computable,
not
really
memory
bandwidth
bounds.
Maybe
latency
bound
something
like
this.
So
what
you
can
try
is
improve
your
fretting
and
the
vectorization,
and
you
might
move
up
a
little
bit,
but
don't
expect
too
much
as
we
like
a
tricky
problem.
A
You
really
have
to
have
to
in
depth
investigation
as
to
follow
what
is
really
going
on
so,
okay,
I
assume
you
did
all
that,
but
then
you
think.
Okay,
so
wait.
Are
you
happy
with
this
guy
here?
Probably
not
right
because
it's
at
the
roofline,
so
you
cannot
do
better
by
right
by
simple
changes,
but
your
you
still,
your
performance
might
be,
might
be
rather
bad.
So
the
only
way
to
improve
that
thing
is
now
to
improve
their
reading
intensity
right,
because
if
you
move
in
this
direction,
you
have
to
like
room
left.
Okay.
A
So
for
these
things
you
need
to
really
look
harder
and
work
harder.
So
what
can
you
do
so
repeated
the
admitting
tentative
slops
overbites
to
improve
it,
meaning
to
increase?
It
means
you
can,
for
example,
improve,
there's
a
really
increase
the
number
of
flops
and
lift
the
number
of
bytes
the
same.
That
sounds
easy,
but
it's
not
actually
not,
because
the
flocks
are
did
like
you're
addressing
determines
the
flops
like
the
problem,
your
tackle
and
the
ibrehem
deterministic
flop.
So
it's
not
easy
to
change
the
number
of
flops.
A
Of
course
you
can
put
in
like
meaningless
lots
of
multiplication
with
one
so
adding
zeros.
But
this
is
not
the
point
here
right.
We
want
to
increase,
improve
the
execution
speed,
so
this
is
usually
another
viable
way.
What
you
go
can
do
is
you?
Can
try
to
keep
the
number
of
flops
constant,
but
try
to
reduce
the
number
of
bytes
read
from
main
memory,
and
reality
you
are,
is
a
trade-off
between
these
these
kind
of
things,
but
it's
so
this
is
so.
A
The
second
point
is
this
is
the
way
to
go
okay,
so
this
is
more
in
detail
what
you
could
do.
First,
you
have
a
lot.
Of
course
you
have
a
lot
of
threads.
Your
vectorization
try
to
reduce,
and,
for
example,
if
you
news
of
a
pea,
try
to
reduce
overhead
okay.
So
what
I've
seen
a
couple
of
times
it's
like
codes,
which
are
written
like
in
a
linear
algebra
sense.
So
maybe
this
is
very
very
stupid.
A
Version
of
this
I
would
say
sorry
no
offense,
but
this
is
like
you
multiply
a
vector
with
a
matrix
store
the
vector.
Then
you
kill
you,
you
subtract
a
vector
from
another
vector
and
then
you
compute
the
norm
of
it.
This
is
more
like
how
you
think
about
it
in
mathematics,
but
this
is
not
how
you
should
write
it
right.
A
You
have
in
physical,
like
in
physics,
problems
whatever
like
this
loop
over
like
free
dimensions,
for
example
in
or
in.
If
you
use
open
P,
there
is
a
nice
statement
called
collapse
which
basically
just
flattens
out
the
whole
loop,
and
this
and
flows
the
frets
like
distributes
them
over
the
whole
fuse
of
near
collapse,
loop,
okay,
so
that
is.
That
is
quite
good,
because
if
you
do
not
put
this
in,
then
it
would
only
flow
the
frets
at
the
outermost
loop
right.
A
So
that
is
something
you
should
you
should
think
about,
and
another
thing
is
you
need
to
rearrange
data
structures?
Probably
so
you
can
try
to
move
to
augmentee
out
so
going
from
a
fine-grained
preservation
model
like
this
one
to
a
more
cost
going
preservation
model.
That
is
not
always
easy
to
do
and
requires
a
lot
of
maybe
like,
for
example,
storing
intermediate
results
and
erase,
and
things
like
that.
But
there
is
like
it's
a
certain
trade-off
and-
and
usually
you
can
gain
a
lot
if
you
do
that
to
a
certain
extent,
I
think.
A
We've
also
some
case
studies
on
the
web
pages,
which
which
describe
this
so
the
next
thing
is
loop
tiling,
so
loop
cunning
improves
cash
for
use
and
therefore
reduces
potentially
the
number
of
bytes
read
from
main
memory,
because
they
are
not
for
read
from
main
memory,
but
from
cache.
Okay
and
cache
has
a
much
much
higher
bandwidth,
so
you
don't
do
not
really
care
about
that.
So
this,
for
example,
is
a
very,
very
simplified
kernel
of
form
from
the
quantum
espresso
material
science
code.
A
It's
technically
we
have
a
matrix
and
multiply
the
rows
of
the
matrix
with
this
P
vector
and
store
it
in
another
matrix.
So
the
problem
here
is
that
I
are
or
like
NIR
it's
long.
So
this
is
very
long
loop
and
it
is
also
very
long
loop
and
what
will
happen
is
when
you
go
through
the
IRS.
It
will.
Basically
you
see
that
that
B
is
not
dependent
on
J.
A
What
will
happen
is
that
it
basically
screams
through
the
a
and
B's
and
then
starts
again
for
the
next
J
iteration,
but
then,
let's,
for
example,
the
the
B's
earlier
in
the
loop
in
the
previous
loop
are
already
evicted
from
cashing
you
to
load
them
again.
So
the
trick
is
here-
and
this
is
always
a
good
thing
to
do-
is
let
block
or
like
tie
loops,
so
it
makes
the
code
look
more
complicated,
but
it
can
significantly
really
significant
improve
performance.
A
So
what
we
did
here
was
we
defined
a
block
size,
so
this
might
be
architecture
dependent
or
you
might
use
the
block
size
which
works
well
on,
like
the
architectures.
You
are
targeting
do
some
index
calculation.
So
don't
worry
about
it,
but
the
idea
is,
you
basically
have
no
iteration
over
over
chunks
of
the
inner
loop
okay.
So
this
is
like
this
is
iterate
iterate
over
block,
and
here
you
iterate
over
a
given
block.
Okay,
and
by
that
you
can
keep
this
bi
a
memory
for
all
the
J's,
for
example.
A
So
this
this
is
very
important
and
especially
because
on
KL
we
don't
have
the
the
l3
cache
right.
So
this
kind
of
codes
like
that
the
first
loop
I
showed
you
usually
the
big
l3
cache
help
you,
because
if
you
couldn't
find
the
data
in
l2,
you
went
to
l3
and
found
it
there,
probably,
but
on
K&L
you
don't
have
it
so
every
l2
miss
goes
to
DDR
or
like
MCD
Ram
right.
A
If
you
go
to
main
memory-
and
you
don't
want
that
and
when
you
try
to
block
these
kind
of
things,
you
can
try
to
block
to
shared
l2.
So
an
l2
is
shared
between
two
cores
on
the
tile
right
and
the
shared
means,
like
the
fraction
of
a
course
of
it,
is
like
around
500
kilobytes.
Ok.
So
if
you
try
to
block
your
loop
content
to
this,
500
kilobytes
usually
get
a
good
l2
hit
rate
and
in
order
to
see
if
these
kind
of
transformations
are
successful
is
in
tuning
them,
you
can.
A
You
can
use
V
tune
to
check,
for
example,
l1
l2,
miss
rate
right.
You
can
just
look
and
see
how
big
the
Miss
rates
are
and
then
adjust
the
block
size,
for
example.
Accordingly,
then,
the
other
thing
you
can
try
is
short
loop
unrolling,
and
this
is
basically
just
helping
the
compiler
to
vectorize
what
you
want
him
to
vectorize.
For
example,
you
have
this
nicely
collects
loop
with
a
collapse,
free
statement
and
then
in
here
you
have
some
non
calculation
over
like
if,
like
your
free
vector.
A
So
if
you
leave
it
like
this,
what
can
happen
is
that
the
compiler
sees
ok,
nice.
This
thing
is
collapse.
Oh
wait
is
another
loop
here,
let's
vectorize
it,
so
the
outer
vector
eyes
are
my
try
to
vector
is
dead
thing,
but
then
it
has
a
trip
kind
of
free,
so
you've
aced
a
lot
of
vector
names.
You
don't
want
a
vector
eyes
here.
A
You
want
a
vector
s
here
right
because
using
these,
these
indices
might
be
big,
and
if
you
don't
put
the
Cindy
statement
here,
the
compiler
will
probably
take
this
loop
too
for
vectorization
as
a
target,
and
that
is
not
great.
So
what
you
need
to
do
is
to
unroll
this
loop,
and
this
is
only
true
kind
of
free
right.
A
So
not
a
big
deal,
you
just
just
unroll
it,
and
then
you're
done
especially
with
this
loops
with,
like,
like
trip
counts,
which
are
not
an
integer
multiple
of
2
or
4,
or
something
they
are
really
active.
Compiler
does
sometimes
does
partial,
unrolling
and
and
these
kind
of
things
and
usually
spends
a
lot
of
or
in
a
lot
of
cycles,
on
on,
like
stitching
DD
these
loops
back
together.
A
Sometimes
it's
especially
like
the
bottom,
the
very
button,
because
when
it
goes
through
the
fluid
for
the
code
and
a
compiled
it
says,
yeah,
this
loop
has
a
vectorized.
The
suppressant
vector
has
vectorized.
In
the
end
it
says:
oh
wait:
I
did
not
vectorize
these
loops
because
they
are
it's
like
probably
inefficient.
Something
like
that.
So
definitely
check
the
compiler
output.
It
is
vectorized,
two
loops.
You
want
them
to
vectorize
or
use
Intel
advisor
it.
Basically,
just
it's
a
very
fancy
way
of
parsing
these.
A
A
So
one
thing
about
vectorization
data
alignment
and
you
should
align
and
pet
data.
Fortunately,
in
Fortran
for
example,
gfortran
does
it
automatically
for
you
if
you
use
the
inter
compiler,
you
have
to
tell
him
to
do
it.
So
if
you
compile
your
code
and
Fortran,
try
to
put
in
a
line
array,
64
byte,
so
that
all
the
rates
are
nicely
aligned
in
c
and
c++,
it's
not
so
easy.
Unfortunately,
so
what
you
have
there
is
like
this
aligned,
a
lock
or
in
I,
think
this
is
glue,
attribute,
aligned
and
then
for
info.
A
You
have
two
speckle
spec
aligned
64
this.
This
goes
directly
to
the
either
dimeric
statements
or
the
new
and
then
in
c++.
You
can
play
the
trick.
Just
overload
the
new
operator
with
this
aligned
alex
and
you're
done
right.
So
that's
usually
a
button
see
you
don't
have
this
option.
So
that's
you
can
write
a
function
like
overloading
by
the
function
and
use
this,
but
you
technically
have
to
go
for
the
code.
A
A
A
I
found
it
already
in
the
tweak
version,
but
this
is
a
kernel
you
could
technically
find
in
in
there
in
the
wild
looks
like
an
arm
is
moving
kernel
from
a
multi
grid
code
and
it's
even
odd
preconditions,
so
what
it
does
it
iterates
over
there
even
sides
and
then
over
the
odd
side,
and
this
is
an
inefficient
version.
So
we
have
an
affinity
statement
here
and
inside
this
this
vectorized
loop.
We
have
a,
if
condition
so
the
sort
of
the
KNL
can
mask
out.
So
it
can
detect
this
this.
A
So
basically
it
can.
It
can
take
like
only
the
even
elements
and
then
gather
them
pack
them
together
in
to
field
of
fill.
The
vector
unit,
execute
this
and
then
basically
omit
it
and
inject
it
back,
and
this
is
this
is
much
more
efficient.
For
example,
in
this
case,
the
app
runs
in
0.8
seconds,
so
get
a
1.5
speed-up
just
by
this
tiny,
tiny
thing
so
watch
out
for
these
kind
of
things,
when
you
vectorize
and
really
try
to
play
around
to
shuffle
this
conditional,
so
first
get
try
to
get
rid
of
them
completely.
A
But
if
you
can
try
to
do
these
kind
of
transformations
to
basically
reduce
those,
that's
also
on
saying,
which
is
called
reduced
position
map
so
like.
If
you
have
a
lot
of
transcendental
functions
of
square
roots,
Exponential's
whatever
they're,
expensive
and
they're
the
canal,
the
ICEA
has
a
so
called
reduced,
precision
variants
of
that
and
you
can
enable
them
by
specifying
floating
point
models.
Hat
equals
two
in
the
compiler,
with
no
precise
this
during
compilation,
and
this
can
help
you
so
don't
expect
too
much
from
it.
A
But
if
you
like
very
heavily
using
these
kind
of
things
that
might
really
help
you
also.
What
we
found
is
a
funny
funny
saying
if
you
have
like
something
like
this,
you
divide
by
a
consonant
in
a
loop:
don't
do
that
just
define
the
inverse
of
it
and
multiply
with
it.
It's
funny
that
a
compiler
does
not
necessarily
pick
that
up
right,
especially
if
you
have
like
a
like
a
bigger
code,
so
this
dis
sometimes
gives
you
like
10%
for
free
something
like
this.
A
B
A
A
So
this
displays
to
speed
up
when
you
compile
the
code
of
avx-512
versus
compiling
the
code
of
avx2
for
the
optimized
codes,
so
they
already
nicely
vectorize,
and
so
the
median
speed-up
we
get
is
about
1.2.
So
like
20%,
okay,
but
there
you
can.
The
benefits
can
be
much
larger,
for
example,
here
right
that
can
have
multiple.
A
So
even
you
can
technically
reduce
memory
latency
by
using
that,
of
course,
fx5
service
automatically
enables
when
compiling
for
KL.
So
you
don't
have
to
do
anything
so
here
we
had
to
manually
disable
it
in
order
to
test
it
use
MCD
RAM,
which
I
talked
about
this
already
I
I
can
recommend
just
always
use
it.
Don't
don't
don't
don't
don't
think
about
not
using
it.
So
this
is
the
alternate
cross
section
of
an
initial
codes.
This
is
speed.
Apple
get
to
the
gray
bars.
A
Are
the
MCD
Ram,
the
speed-up
you
get
when
you
run
your
code
from
fully
from
consider
or
like
family
cd-rom
versus
running
it
from
DDR,
and
the
red
gives
you
this
so,
which
is
like
for
the
solid
gray
might
be
either
flat
or
cash.
Whatever
the
best
configuration
was
for
that
code,
and
then
we
compare
additionally,
what
is
the
benefit
of
going
to
flat
instead
of
off
cache
mode?
So
what
you
see
is
that
MCD,
when
always
helps
right,
there's
no
case
where
it
really
hurts
you.
A
So
this
is
like,
like
a
like
there's
no
good
speed
up,
but
technically
it
what
it
won't
hurt.
You
so
always
use
it
flat
versus
caches
bit
more
mixed
back,
because
if
you
want
to
use
flat,
you
either
have
to
do
this
manual,
placing
the
race
where
you
want
them,
but
then
you
have
to
also
think
about
which
arrays
to
place.
A
I
cannot
recommend
using
this
numeracy
to
your
presearch,
because
what
will
usually
happen
if
you
spill
out
that
used
using
many
codes
like
at
the
beginning,
you
initialize
a
lot
of
different
arrays
for
maybe
setting
up
the
whole
thing
and
the,
but
the
really
hot
erase
you
work
with
come
usually
later
on.
So
it
might
be
good
probability
that
those
will
be
allocated
into
into
into
DDR
and
that
you
don't
want.
A
So,
in
my
opinion,
in
this
case
is
use,
can
use
use
just
the
cache
mode
or,
if
you
fit
into,
if
you
fit
into
16
gigabytes,
university
or
and
then
it
will
error
out,
if
you,
if
you,
if
you
will
run
out
of
memory,
so
at
least
you
know
what
what
happens,
and
this
is
why
mostly
I
think
these
kind
of
links
the
flat
performance
is
worth
because
they
use
numerous
et
al
P.
A
A
That's
one
note
on
heap
allocation,
kennel
memory
cache
and
compare
be
slow.
It's
not
super
bad
if
you
have
like
a
normal
code,
but
some
codes
have
like
kernels,
where
they
allocate
a
lot
of
lot
of
arrays
and
you
locate
them
and
that's
very
bad
practice.
So
if
you
have
something
like
an
iterative
solver
and
you
in
that
Pro
in
every
iteration
step,
you
do
a
lot
of
memory
allocations.
Don't
do
that
move
them
out.
Okay,
so
just
move
them
out
as
far
as
possible,
so
that
you
allocate
technically
just
your
whole
stuff.
A
Once
if
this
is
to
involve
a
huge
code
framework,
you
can
try,
you
can
think
about
pule
located
libraries,
what
they
technically
do,
for
example,
into
TBD.
They
overload
you
may
log
or
whatever,
and
and
basically,
instead
of
really
allocating
memories
as
an
operating
from
memory.
What
they
do
will
they
will
ask
the
allocator
to
hand
out
memory
which
was
pre
located
like
so
that's
much
much
faster
and
the
good
thing
is.
You
do
not
need
to
change
the
code
except
for
linking
properly.
A
The
bad
thing
is
that
you
need
to
know
the
memory
footprint
across
us.
You
allocate
a
pool
and
if
you
run
over
mostly
it
will
just
crash
without
of
memory,
and
the
code
is
less
portable
right.
So
that's
that's
the
drawback,
so
the
best
thing
is
just
don't.
Do
a
lot
of
allocations
in
the
fire.
My
head
for
multi
nodes,
yeah,
so
they're
not
assets,
not
not
so
many
things
you
can
do,
but
one
important
thing
to
consider
is
that
a
single
K&L
thread
cannot
separate
the
arise
injection
rate.
A
So
you
get,
you
do
not
get
the
full
bandwidth.
So
this
is
a
plot
from
the
some
multi
rank.
Bandwidth
I.
Think
it's
just
a
ping
pong
thing
so,
like
you
have
two
nodes
sending
message
to
each
other,
and
this
is
for
one
rank,
/
no
to
rank
for
node,
4
and
so
on,
and
this
is
the
the
bandwidth
you
measure,
depending
on
the
size
of
messages,
and
what
you
can
see
is
that
so
this
so
discuss
P
is
technically
a
protocol
change.
A
But
what
you
can
see
is
that
you
flat
out
right,
but
also
like
one
rank,
is
pretty
pretty
bad
in
terms
of
bandwidth.
So
what
you
want
to
do
is
you
want
to
run
more
than
one
MPI
rank
per
node.
If
you
have
this
kind
of
pattern,
because
it
can
get
you
a
lot
of
lot
of
benefit
in
this
region,
especially
of
course
we
want
to.
A
We
want
to
like
a
cartoon
us
to
have
to
make
a
mix
model
between
open,
NPN
MPI,
and
this
looks
like
we
encourage
users
to
do
opposite,
because
the
64
ranks
per
node.
You
only
get
the
maximum
benefit
for
block
messages,
but
in
that
case
you
can,
for
example,
think
about
using
a
threaded
communication,
MPI
and
I.
Think
with
MPI,
for
they
are
even
more
discredit.
Communication
will
be
even
better
and
that
that
would
be
definitely
good
opportunity
to
to
really
really
saturate
the
memory
bandwidth.
A
So
I
can
recommend
more
than
four
ranks
per
node
like
let's
say,
four
ranks:
eighth
rank,
16
ranks
per
node
problem
and,
as
Helen
said,
there
dedicate
course
to
the
operating
system.
So
if
you,
for
example,
use
four
eggs
per
node
or
eight,
you
cannot
like
divide.
This
sixty-eight
course
nicely,
so
that
you
end
up
with
like
like
is
that
you
do
not
split
tiles.
A
So
that
means
you
can
just
assume
that
you're
64
course
and
dedicate,
for
example,
two
cores
to
the
operating
system
right
and
that's
usually
good
choice,
because
then
you
have
all
some
error
mitigation,
like
noise
mitigation,
so
use
pages
reset
it
again.
So
huge
pages,
what
they
technically
do
I
think
they
reduce
translation,
lookaside
buffer
misses
in
the
arise
when
it
translates
addresses
and
how
to
use
it.
A
So
you
basically
load
one
of
the
huge
pages
module
at
compile
time
just
one,
so
it
doesn't
matter
which
one
right
you
you
just
need
to
have
one
loaded
and
at
runtime
you
load
the
huge
page
module
size.
You
want
right,
so
you
can
load
you
can
at
compile
them.
You
can
load
the
two
megabyte
one
and
runtime
you
can
use
any
other.
So,
for
example,
that's
that's
nice
and
there's
also
some
some
MPI
mph
event
variables.
A
You
can
try
out,
for
example,
if
you've
codes,
which
are
very
collective,
heavy
non
blocking
or
blocking
this
matter,
try
to
use
the
map
so
at
LD
map.
So
that's
was
for
static
linking
and
for
dynamic
linking
for
static,
linking
it's
a
bit
more
and
more
more
elaborate,
but
I
can
get
give
out
the
screen
and
then
try
to
export
these
variables
here.
A
What
they
technically
do
is
they'll
activate
remote
memory,
access
over
the
schemas
library,
and
then
they
tell
D
map
to
basically
to
grab
the
collectives
from
MPI
and
use
the
D
map
library
to
execute
them,
which
then
use
RMA
to
be
executed,
and
this
is
also
for
certain
collective
operations.
I
think
that
there
are
somehow
done
in
the
network
buffer.
A
So
these
kind
of
things
when
you
set
them
that
might
in
certain
cases
give
you
like
up
to
20%
speed-up
at
least
is
what
I've
seen
if
you
do
is
MPI
free
single
sided,
remote
memory
Atomics,
for
example.
You
can
also
try
to
set
this
guy.
This
is
especially
cool
for
small
messages,
so
it
feels
like
in
accumulate,
get
accumulate
or
whatever
operation,
then
for
an
integer
or
single
float
or
double
or
something
you
can
set
this
and
what
it
will
do.
A
I
think
it
will
use
the
hardware
to
to
to
do
this
to
do
the
locking.
So
this
is
this
can
give
you
up
to
a
20
X
speed-up,
but
only
for
smaller
messages.
So
if
it's
like
so
later,
latency
night
might
be
20
X
better.
So
if
you
have
bigger
messages
that
might
actually
hurt
you
ok,
so
last
thing
is
know
some
notes
on
I/o.
A
So
this
is
the
right
bandwidth
in
megabytes
per
second
on
hassle
and
KL.
So
please
ignore
the
stream
can
direct
IO.
So
this
is
for
the
buffered
I/o
and
what
you
see
it's
like
KL
is
2
X
slower
for
single
core,
ok,
so
that
is
that
doesn't
sound
great
for
people
who
do
a
lot
of
my
own.
But
there
is
a
solution
and
everything
on
can
nail
just
go
to
multi-core.
Ok,
so
this
is
the
bandwidth
you
will
get
with
notes
and
different
course
per
note
right.
So
this
is
the
bandwidth
on
on
this.
A
Is
the
I
think
one
is
right
when
it's
read
I
think
this
is
read.
This
is
right,
very
core
Craig,
it
doesn't
say
it
actually
I
tweet
readers
right
here
so
and
what
it
tells
you
you
see
that
at
a
single
at
a
single
core
cahnnel
looks
very
bad.
But
if
you
go
to
multiple
cores
you
can
even
outputs
long
hassle,
easily
and
I
mean
I.
Don't
recommend
like
going
like
to
64
64
way,
frets,
but
technically,
if
you
just
use
like
8
or
16,
you're,
always
better
off.
A
Okay,
there's
one
problem
and
I
talked
to
our
I/o
people
like
Jen,
Lin
and
Glenn.
There
are
no
good
fredita
with
solutions
available,
so
either
you
implement
your
own
or
you
I.
Think
I,
don't
know
if
there
are
some
issues
with
POSIX
or
you
use,
for
example,
multiple
processes,
and
then
you
can
use,
for
example,
MPI
I/o
right
dude
I
also
do
not
do
not
do
the
following
this.
If
you
want
to
write
out
data
gather
every
single
node,
0
and
write
it
out,
don't
do
that
and
rank
0
and
provide
out.
Don't
do
that.
A
Please,
yes,
try
to
use
MPI
or,
for
example,
HD
5,
which
use
MPI
or
under
the
hood
for
parallel
all
right.
So
that's
the
best
take-home
message
here:
try
to
paralyze
you
I/o!
If
you
are
really
having
to
write
a
lot
of
data
and
then,
which
is
always
a
good
good
thing
to
do,
especially
for
lastra
write,
big
chunks,
so
pull
your
I.
A
Oh,
don't
don't
write
small
chunks
once
in
a
while
like
try
to
pull
it
and
then
write
a
big
chunk,
because
every
time
you
write
small
data,
you
might
pinger
the
last
remit,
a
data
service
which
really
slows
you
down
and
reduce
the
file
operations,
so
the
open
and
the
closes.
So
don't
don't
like
open
closed
files.
All
the
time
I
try
to
try
to
reduce
this
and,
of
course,
try
to
reduce
the
whole
like
the
whole
file,
the
whole
number
5
to
1
right
for
large
files.
A
So
it's
not
to
tell
when
it
will
help
you,
but
for
large
files
like
other
10
gigabytes
and
hires
try
to
burst
buffer.
So
this
is
a
completely
different
world.
It
might
be
different
tutorial
and
a
sound
said
that
you
should
try
it
out.
We
have
a
link
to
page.
So
if
you
click
on
this
this,
this
taxable,
you
will
find
you
would
bring
it
to
the
birth
locker
page
and
you
can
just
try
it
out.
It
might
help
you
a.
B
B
A
One's
here,
yeah
I,
don't
know
this
I
got
the
data
from
from
here.
I
believe
that
somehow
and
I
think
Brian
checked
it,
and
it
said
that
that
looks
okay,
so
coming
back
so
I
talked
about
all
these
optimizations
and
now
you
think:
okay,
that's
the
stuff
help
and
for
we
applied
these
optimizations
for
this
code.
So
this
is
like
the
before
Dhalsim
ization
right.
So
we
applied
like
selections
of
these
optimizations
to
these
codes
right
and
maybe
a
little
bit
more
so
more
interest
about
this.
A
You
can
find
on
the
case
studies
I
will
post
the
link
later.
So
this
is
how
it
looked
like
with
a
medium
speed
over
50%
on
Edison
and
on
half.
While
you
were
slower
and
now
after
that
process,
we
get
a
median
speed
up,
and
that
is
now
the
optimized
code
on
half
on
sorry
on
K&L
versus
the
optimized
code
on
Edison,
so
you
get
in
1.8,
X,
speed
up
now
and
median
and
on
hassle
it's
about
a
spot,
even
okay.
A
A
So
when
you
optimized
code,
go
to
single,
go
to
focus
or
similar
performance
first
and
then
try
definitely
try
loop,
fusion
and
tiling
that
helps
a
lot,
ensure
good,
vectorization
and
use
MC
all
the
time
and
multi
node
performance.
Please
try
out
huge
pages,
very
simple
right,
just
load
a
module
and
compile
and
try
this
team
app
stuff,
which
is
also
like
setting
a
couple
of
environment,
variables,
method
and
fire
performance
are
totally
rewarded
like
try
to
use
color
or
pull
it.
And,
of
course
we
use
file
operations
to
minimum.
A
So
we
have
a
lot
of
training
material
FR.
So
this
is
like
all
of
them,
a
hyperlink
for
running
jobs.
How
to
do
the
fret
binding
for
code,
profiling
and
tools.
We
have
a
lot
of
tools
to
offer
how
to
measure
their
phonetic
intensity.
We
have
a
set
of
scripts
which
basically
grabs
the
output
of
each
one
and
and
and
ste,
and
just
gives
you
the
numbers
you
are
interested
in.
A
So
that
you
don't
need
to
go
through
the
GUI
and
look
for
that
and
how
it
can
improve
up
my
piece
scaling,
vectorization,
MCD,
ram
and
also
very
important.
Please
look
at
the
case
studies,
so
you
might
have
a
code
which
is
very
similar
to
a
code
we
already
optimized,
especially
if
your
functions
are
extensive,
based
stuff,
definitely
look
into
the
case
studies
and
maybe
in
the
literature
there
might
be
there's
a
lot
of
a
lot
of
different
things
around
which
can
really
help
you
to
optimize
that
yeah.