►
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Let's
get
started
the
afternoon
session,
so
I'm
songs
is
all
from
user
engagement
group
a
desk
and
I'm
going
to
give
this
short
talk
about
how
to
use
video
NAT
nurse
okay,
so
we
tune
is
a
intel
profiling
tool
and
the
focus
of
it.
Basically,
it's
all
know
the
performance
so,
but
it
works
with
both
serial
and
also
parallel
codes,
so
it
provides
both
command-line
interface
and
also
the
GUI.
So,
unlike
more
generically
next
cluster,
you,
the
users
like
the
vision,
GUI
to
run
your
performance
analysis
and
then
also
this
player
in
with
GUI.
A
But
in
our
case
we
recommend
to
use
command
line
interface,
to
collect
data
and
then
later
display
on
the
you
know,
on
a
on
a
login
node.
So
the
reason
we
do
that,
as
because
in
in
our
case,
you
may
run
like
a
lot
of
large-scale
jobs
and
we
use
a
lot
of
node
accounts
and
then
it's
not
easily
handled
by
you
know
interactive
run
and
another
little
bit
more
technical
issue
is
actually
after
we
switch
to
slurm.
Now
it
is
easier
to
run
GUI,
but
it
has
a
little
bit.
A
Historical
theme
back
will
run
torque
moab.
Actually
the
GUI
didn't
even
work
on
our
computer
node
at
that
time.
So,
but
anyway,
this
is
our
recommended
way.
So
you
run
the
command
line
and
collect
the
data
and
then
later
display
on
the
on
a
login
node.
So
for
those
people
connected
to
nurse
like
from
remote
sites,
then
we
recommend
to
use
these
NX
to
speed
up
the
x11
applications.
A
So
actually,
if
you
are
very
far
from
nurse,
you
may
see
the
bigger
delay
in
your
graphical
display,
but
they
use
NX.
Basically
that
solves
that
problem.
So
the
way
tune
is
available
on
curry
as
a
module
like
all
our
other
software,
so
our
current
default
is
2017,
updated
tool.
The
reason
I'm
emphasizing
the
the
version
here
is
because
we
tune
keep
changing
its
interfaces
and
how
it
looks
and
how
it
you
know,
display
and
stuff
like
that.
So
and
also
the
analysis
types
can.
A
A
Ok,
so
in
this
talk
my
focus
would
be
just
providing
you
a
step
by
step.
You
know
the
just
to
provide
you
how
to
run
with
your
non
curry,
K&L
nose
and
then
the
rest
of
the
thing
I
I
would
consider
that
your
homework
to
do
so,
basically
I'm
just
talking
about
how
how
could
we
tune
get
run
on
our
systems
because
we
have
a
nurse
custom,
the
you
know
stuff.
When
entered
the
you
know
the
custom
stuff.
You
need
to
do
to
run
this
application.
A
So
first
you
need
to
compile
your
code
with
a
debug
flag
G
over
there
and
then
another
important
thing.
Is
you
also
need
to
link
the
code
dynamically
which
in
our
case
our
default
is
a
static
linking
so
you
need
to
use
this
compiler
wrap
of
red
Dinah
and
then
use
them
with
our
compiler
wrappers
to
build
your
application.
So
once
this
is
done,
then
you
can
run,
but
there
are
a
couple
of
more
notes
for
this
compilation
stage.
A
So
you
can
see
we
recommended
this
debug
debug
in
line
debug
info
is
a
you
know
to
be
used.
The
reason
to
add
this
is
you
know
in
in
the
code
compilation
some
of
the
codes
in9.
So
it
will.
If
you
use
this
flag,
then
you
will
be
able
to
get
the
you
know
more
information
from
the
inline
codes,
so
another
thing
I
want
to
mention
is
many
of
the
profiling.
Tools
are
debugging
tools
not
profile
into
the
debugging
tools.
Usually
ask
you
to
turn
off
the
optimizations
when
you
try
to
use
their.
A
You
know
tools,
but
this
one,
it's
okay,
you
don't
have
to
so
the
you
know
you
can
just
keep
whatever
you
use
for
the
optimization
flag
in
your
in
your
normal
use
of
the
code
and
then,
although
this
is
not
required,
but
Intel
compilers
are
recommended,
so
it
should
work
with
other
compilers,
but
because
this
is
Intel
product
I
believe
these
are
most
mostly
tested
with
Intel
compiler.
So
this
is
just
extra
thing.
We
want
to
recommend
so
make
your
life
easier.
So
sometime
we
run
into
many
things.
Actually
it's
not.
You
know
well
tested.
A
So
this
is
a
recommended,
and
the
other
thing
I
want
to
I
forgot
mention
is
so
the
synthetic
build
it's
not
necessarily
like
it
may
work,
but
we
see
many
of
the
case
like
we've
seen
that
called
seg
fault.
If
we,
if
we
will
be
able
to
static,
you
know
binary.
So
that's
just
a
note.
So
the
the
way
to
compile
code-
this
is
I
just
take
a
skeleton
cold
is
really
small
cold.
This
is
a
hybrid
MPI
openmp
code,
so
you
can
compile
like
this.
A
So
as
a
Helen
already
showed
to
compile
for
K&L
you
can
you
need
to
swap
the
default
LOD,
the
creepy
Haswell
module
to
a
creepy,
mic,
K&L
module
and
then,
as
mentioned
the
us
dynamic
and
then
use
this
city
and
then
also
I
used
the
dista
bug
in
line
info
option
and
use
this
to
build
your
code
and
then
to
run
wheaton
here
is
the
nurse
customization,
so
you
need
to
use
the
as
batch
directive
called
perv
equal
to
with
you.
This
is
a
very
important
flagging.
You
have
to
use
under
the
hood.
A
What
is
a
flag?
Adults
actually
is
just
to
tell
the
pension
system
to
prepare
nose
for
you,
so
that
those
nodes
can
do
the
you
know
whatever
with
you
and
tells
them
to
do
so.
Basically,
it
needs
some
way
to
need
some
kernel
drivers
to
be
loaded
so
to
be
able
to
collect
some
hardware
event
based.
You
know
profiling
data,
so
that's
the
requirement,
but
the
reason
we
want
to
do
this
dynamically
is
because
actually
we
tuned
it's
a
it's
touches
like
a
very
lower
level
stuff.
A
So
we
often
see
that
now
it's
much
more,
you
know
stable,
but
when
we
first
use
this
on
create
systems,
we
often
see
that
it
kills
nodes
and
very
fragile.
You
know
so
some
of
the
current
kernel
modules
that
we
can
use.
We
don't
want
them
to
be
loaded
by
default
on
all
the
you
know,
computer
nodes,
so
this
is
the
way
we
manage
the
region.
A
So
it's
like
you
start
a
job
load
of
those
kernel
modules
and
when
the
app
quits,
then
those
allowed
modules
will
be
removed
and
then
another
thing
is:
you
need
to
load
the
way
to
module
before
submitting
the
ABS.
This
is
actually
new
edition
is
because
now
we
support
multiple
kernel
drivers
on
the
computer
knowledge,
so
you
in
your
before
submitted
Java,
you
need
to
load
or
written
module
so
that
the
the
better
system
can
load
the
corresponding
kernel
drivers.
A
So
this
is
recently
added,
and
another
thing
is,
is
the
last
you
have
to
use
the
last
a
file
system,
I'm
saying
it's
not
necessary.
Last
a
file
system.
Template
is
fine.
It's
just
a
memory.
That's
fine,
but
the
reason
here
is:
our
global
file
system
actually
uses
crazy,
so-called
DVS
layer
to
access
the
to
access
them.
I
mean
from
computer
computer
know
to
access
those
global
file
systems,
but
these
DVS
it.
It
does
not
support
all
the
memory
map
function
and
map
function.
A
So
something
just
doesn't
work
so
this
this
last
a
file
system
is
a
require
that
you
have
to
run
the
returns.
You
have
out
of
this
Laster
file
system.
So
now
the
newer
version
of
we
turn
actual
reports.
So
if
you
run
that
on
the
global
file
system,
like
project
or
like
home,
it
will
give
you
a
nice
informational
message
and
ask
you
to
switch
filesystem.
So
this
is
a
much
better
but
back
in
earliest
time.
Actually
it
just
fail
and
with
some
misleading
messages,
but
now
it's
much
better.
A
So
here
is
the
command
you
can
type
so
just
to
get
directory
in
the
last
file
system
and
then
load
the
module
and
then
I'm,
taking
an
example
like
using
cell
log
so
to
run
the
interactive
way
to
enjoy
so
after
you
get
the
computer
knows
then
inside
the
job
they
on
the
computer
know
that
then
these
are
the
commands.
You
need
to
type
say:
module
load.
We
turn
probably
this.
A
This
can
be
skipped
because
we
already
loaded
from
the
outside,
but
and
then
let
us
say
this
is
the
example
I'm
getting
here
is
an
open,
MP,
MPI
hybrid
code.
So
we
need
to
do
this
affinity
thing
here.
So
we
put
this
to
environment
variable
there
and
how
many
open
MP
says:
I
want
to
use
so
put
this
here
and
then
the
s
wrong
commander
line
is
the
same
as
you
would
normally
run
a
code,
but
before
your
executable,
this
check
that
dot
X.
A
This
is
your
executable
before
that
you
put
this
with
your
command
line,
so
those
are
ample
X,
ECL
and
then
minus
collect
so
after
collect
this
one
comes.
This
is
analysis
type.
So
now
this
line
Excel
tells
region
to
do
the
memory,
analysis,
experiment
and
then
the
minus
R
tells
where
to
store
the
result,
and
then
this
chase
MPI
is
actually
is
needed
for
the
MPI
code
so
that
they
have
one
for
each
rank.
A
It
will
have
one
profiling
data,
one
file
to
store
the
providing
data,
and
then
here
are
a
couple
of
tips
so
for
to
run
I
mean
to
collect
data
actually
written
by
default.
It
would
do
automatically
finalizing
data,
which
means
once
it
collect
raw
data.
It
will
process
it
on
the
before
quit
the
job.
So
here
we
have
extra
option
here
so
finalization
mode
equal
to
no
equal
to
none
means
we
want
to.
We
don't
want
to
do
any
post
processing
after
the
raw
data
collected.
A
So
the
reason
we
do
this
is
because
a
single
thread,
speed
on
as
much
as
slower
than
I
could
love
in
order.
So
if
you
do
this
on
the
computer,
no
it
will
takes
a
long
time.
Basically
you
can
say
it's
two
times
it's
lower
on
compared
to
the
Haswell.
So
now
we
move
this
way
deferred
is
2d.
You
know
we
just
collect
the
data
and
then
do
this
outside
of
the
batch
job.
A
Okay,
so
another
option
I
want
to
mention
is
this
data
they
actually
emit
equal
to
zero.
This
one
means
we
want
to
collect
infinite
size,
I
mean
whatever
the
you
know,
the
just
just
that
you
know
collect
all
the
data.
Otherwise
the
default
is
a
500
megabyte,
which
means
we
turn
all
the
stuff.
After
you
know,
data
reach
the
5
megabyte
size
so
for
the
real
application
code.
A
Probably
you
run
into
this
this
limit
very
soon,
so
actually
further
I
will
show
you
some
results
from
the
material
science
code,
the
code
of
ASP,
and
then
we
see
that
before
any
iteration
enters,
then
it
already
reached
500.
So
you
may
need
this
for
your
real
runs
and
then
these
are
another
command.
I
would
like
to
mention.
So
you
can
see
the
help
information.
So,
let's,
let's
see,
if
you
want
to
know
what
the
type
of
analysis
available
then
you
can
type
impose
your
help
and
then
connect.
A
Then
it
will
shows
all
the
options
of
this
command-line
interface
and
also
the
types
narrow
seas
types
available
so
actually
for
each
analysis
type.
Actually,
there
are
further
options
in
something
called
knob
option,
so
those
will
fine
tune.
You
know
what
kind
of
data
you
would
collect
with
with
your
experiment.
So
you
can
see
this
ample
help
collect
an
analysis,
type,
something
like
memory
access.
Then
you
can
see
all
the
available
now.
A
Options
for
memory,
analysis,
type,
experiment,
yeah
and
then
so,
let's
say
just
to
show
you
what
kind
of
analysis
are
available
just
see
here.
We
have
these
advanced
to
the
hotspots.
You
can
see
later
what
it
what
it
looks
like,
but
the
names
are
pretty
much
explains
them.
It
tells
you
know
what
what
they
are,
but
I
think
the
most
interesting
one
actually
like
advanced
the
hotspots
and
general
exploration.
A
memory
exists
and
also
this
HPC
performance.
One
is
a
really
good
one.
Actually,
this
is
the
result
of
a
nurse
request.
A
So
it's
a
they
didn't.
Have
this
in
the
I
mean
earlier
version,
but
they
added
this
per
hour
request
and
then
something
like
after
you.
You
know
something
like
this
no
option
for
the
memory
access,
so
there
is
like
a
knob
analyze.
Mem
object
equal
to
actually
this
one
is
critical
actually,
but
unfortunately
I
forgot
to
add
that
in
my
experiment,
so
this
one
what
it
does
is
just
to
map
the
object
with
the
memory
you
know
operations.
A
So
if,
if
OSS
is
some
big
operate,
allocation
goes
on,
they
can
map
that
or
the
specific
object,
but
I
I
didn't
include
in
my
test,
so
I'm
missing
that
data.
But
anyway
those
are
optional.
You
can
add
per
your
per
your
need.
So
if
you
want
to
run
better
jobs,
then
this
is
the
example.
So
you,
if
you
want
to
run
on
cache
mode,
it
just
request
this
node
and
then
just
to
provide
this
ample.
A
The
way
to
on
command
line
use
this
to
collect
data.
So
once
the
data
collected,
then
you
can
using
you
can
use
this
written
GUI
on
the
login
node.
You
can
display
the
result
but
to
display.
Actually,
if
you
open,
you
just
run
this
ample
GUI
command.
You
run,
you
run
this
command
after
you
load
the
module
we
to
module,
then
from
the
the
main
display
you
can
see.
There
is
a
link
called
open
results,
so
you
click
this.
Then
you
can
find
what
it
where
is.
A
Your
result
file
is,
and
then
you
open
it
and
then
it
will
display
the
result.
So
if
the
data
are
not
yet
finalized,
then
upon
the
open,
the
the
file,
the
GUI
or
the
finalized
first
and
then
load
the
result
or
you
can
do
the
finalization
outside
of
the
GUI
so
by
just
around
this
command
finalized
and
give
them
finalization
mode
to
fold,
and
you
can
get
the
data
finalized.
A
Ok,
so
next
I
will
give
some
examples
of
what
the
region
collect
and
what
the
the
in
cui
interface
looks
like
I
think
this
one
is
is
good
to
do
a
little
demo
over
here.
So
now,
I
have
I,
have
an
application
called
the
vest
by
desta
mentioned
it's
a
it's
a
bigger
code
at
risk.
It's
actually,
it
uses
the
most
computing
cycles.
A
By
the
you
know,
we
ran
to
the
cold
how
they
use
our
computer
time
in
this
one
is
the
top
one
and
then
I'm
going
to
show
you
so
the
the
test
I
run
we
did
with
it
is
so
once
you
open
the
GUI,
the
interface
would
look
like
this
and
then
you
say
open
result.
You
click
this
so
I
already
opened
the
so
I
want
to
skip
that,
or
maybe
let's
do
that.
A
A
This
is
memory
analysis,
so
here
it
says
this
view
is
how
did
this
collected
and
then
it
says,
use
it
is
to
identify
the
potential
memory,
access
related
issues-
and
you
know
there
are
more
so
it
can
tell
you
what
this
view
provides
and
then
what
you
can
get
from
this
display.
So
let
we
can
let
it
go,
and
then
you
can
see
the
the
first
one.
This
is
a
summary
review,
its
summaries.
A
You
know
all
the
things
something
like
here
so
elapsed,
the
time
this
is
the
time
used
for
the
the
wall,
time
used
by
the
by
the
code
and
then
actually
this
is
very
well
optimized,
the
code,
so
you
can
see-
and
so
if
this
is
not
very
well
optimized
that
you
couldn't
see
multiple
flags.
So
something
like
here.
This
is
a
the
red
flag.
We
can
see
only
one
here,
it
says
it
has
a
high
l2,
miss
l2
misses
so
the
nice
thing
about
vision.
A
Is
you
just
move
your
cursor
onto
the
whatever
you
know
things
show
up
in
the
interface,
so
let's
say
l2
miss
bound.
This
is
what
they
use,
and
then
they
explain
what
it
is.
So
the
the
l2
miss
bound
America
shows
a
ratio
of
cycles
spend
spend
handling
out
to
mrs.
to
all
cycles,
so
it
defines
what
it
display
and
then
here
it
shows
how
to
improve
it.
It
says
you
consider
hearsay
what
it
is
and
then
the
potential
way
to
improve
this.
So
this
is
a
very
good
part
of
retune.
A
So
the
other
thing
you
can
you
can
see
from
the
summary
report
is
bandwidth,
wins
utilization
report
so
from
here
this
is
a
histogram.
So
what
etta
shows
is
this
is
the
elapsed,
the
time
the
the
vertical
exists,
and
then
here
is
the
it's.
It's
a
show
here.
It's
in
gigabyte
per
second.
So
that's
the
unit.
This
is
a
dram
memory,
bandwidth
utilization.
So
that
means
this
graph
says
the
code
uses
like
one
gigabyte
per
second
bandwidth
for
the
most
of
the
execution
time.
So,
if
you
put
the
cursor
here,
you
can
see
band
away.
A
This
utilization
is
this:
is
one
gigabyte
per
second
and
during
you
know,
92
second,
at
the
bandwidth
is
your
utilization?
Is
you
know
one
gigabyte
per
second,
so
you
can
see
even
more
like
about
here.
You
can
see
three
gigabyte.
There
is
only
like
a
few
seconds.
It
spends
like
3
gigabyte.
It
uses
that
many
bandwidth,
and
here
you
have
option.
Now
we
have
a
multiple
hierarchy
of
memory
structure,
so
you
can
see
this
one.
Actually
I
ran
as
a
cache
mode.
So
you
can
see
it
shows
the
memory
utilization
for
the
MCD
Ram.
A
So
you
can
see
at
a
1,
it's
like
maybe
less
than
one.
Second,
it
has
to
reach
the
like.
Almost
the
the
theoretical
peak
over
here.
No,
it's
a
four
times:
no,
it's
much
lower
than
theoretical.
It's
a
470
gigabyte
per
second,
at
least
but
at
least
we
see
at
a
certain
point
it
reached
like
big
bandwidth
usage,
but
it's
lower
and
going
down,
but
the
Sudheer,
the
majority
of
parties
using
like
not
big
of
bandwidth,
and
then
there
is
a
bottom-up
view
which
provides
more
detail
about
the
you
know
this
test.
So.
A
So
you
can
see
here
the
this
is
the
bandwidth
for
the
D
ROM
and
then
this
is
the
time
series.
So
this
is
along
the
execution
time.
What
is
the
bandwidth
utilization?
So
you
can
see
from
here
the
ROM.
You
know,
I
I
believe
this
is
the
initial
stage
where
we
can
try
to
set
up
the
test
and
then
from
here
it
shows
the
you
can
see.
Cursor
vo
tell
like
unpackaged
0.
This
is
a
theorem
total
at
the
disappoint.
A
It
says
the
memory
utilization
is
it's
less
than
one
gigabyte
per
second,
this
is
a
depend
with
the
seals
and
then
it
also
gives
us
a
read/write
and
also
give
the
MCD
runs
the
result
and-
and
things
like
this
so
lower
part
here.
This
is
a
CPU
time.
So,
let's
see,
if
you
don't
know
what
the
CPU
time
you
know
they
use
a
defined,
then
you
can
go
back
to
the
original
summary
from
there.
You
can.
Actually
you
can
see
that
they
even
define
such
a
you
know.
A
A
appears
are
like
a
really
obvious
one,
but
they
do
define
what
what
they
are
so,
but
anyway,
we
go
to
this
view,
and
then
you
can
see
this
the
pink
pink
shell
here.
Actually
this
is
the
the
place
we
are
we
to
thinks
that
the
bottleneck
is
so
you
can
see
you
you
handle
you
put
the
cursor
on
it.
Then
you
can
see
what
it
explains,
what
it
is
and
then
how
you
can
improve
it.
A
So
here
is
the
memory
view
and
actually
I
can
show
you
one
more
thing
like
something
like
this
HPC
performance
characterization
view.
So
from
the
summary
you
can
see
see
utilization
is
a
17.
This
is
a
flat
over
here.
So
the
reason
it
gets
a
flag
here
is
because
vasp
cannot
make
use
of
the
hyper-threading.
So
if
you
use
hyper-threading
it's
much
slower
than
it
should
so
we
have
in
total
268
right
right,
well,
six
to
eight
times
the
four-year
we
have
that
many
272
cores
yeah.
A
We
have
that
many
cores,
but
we
use
only
64,
so
we
two
thinks
the
utilization
is
a
poor.
So
it
gives
you
this,
but
this
is
a
nothing
we
can
do
about
that
it
just
it's
the
algorithm,
it's
the
nature
of
the
code.
It
just
doesn't
work
well
with
hyper-threading,
and
then
we
go
down
to
the
lower
Bart
lower
place,
so
it
shows
that
the
memory
utilize
bandwidth
and
then
the
memory
and
cache
utilization
over
here.
So
we
see
the
same.
This
is
the
same
system,
but
I
ran
multiple
times
with
the
different
analysis.
A
So
you
can
see
a
similar
report
over
here.
It
said
how
to
miss
bound
is
the
previous
one
shows
24%.
This
is
a
23,
but
anyway
the
shows
it
is,
and
you
can
also
get
theorem
em
cd-rom
the
result,
and
here
it
also
shows
that
the
system
D
instruction
per
cycle
since
so
the
the
one
interested
in
a
measure
over
here
is
something
called
pack
of
the
sim
D
instruction.
So
this
one
actually
use
this
measure.
You
can
see
how
code
is
simply
vectorized.
How
well
the
code
is
a
simply
vectorized
you
can
see.
A
This
is
pretty
well,
vectorized
called
actually
was,
but
you
know
developers
they
did
a
lot
of
work
on.
You
know
the
vectorized
is
this
code,
so
we
can
see
the
result
of
it.
This
is
a
pretty
high
I
think
so
you
can
also
do
the
bottom-up
view
and
something
like
this.
So
I
I
run
out
of
a
time
so,
but
the
last
thing
I
want
to
mention
is
there
is
a
the
retune
sometime
appears
a
little
you
know
evolved.
So
if
you
don't
want
to
pay,
they
had
too
much
of
a
learning
curve.
A
Then
another
easy
one.
You
can
try
as
a
so-called
application
performance,
it's
an
app
shot.
So
this
one
is
a
recent
product.
It's
just
around
the
beta
testing.
This
is
open
software
actually
provided
by
Intel,
and
this
one
has
very
nice.
You
know
easy
to
use
but
provide
basically
information
about
how
code
performs.
You
know.
You
know
information,
something
like
CPI
CPI
means
the
cycle
per
instruction.
So
if
this
number
is
high,
that
means
there
are
many
stalls
in
the
code
execution.
A
So
that's
not
a
good
one
so
that
it
flags
over
here
and
then
also
something
like
it
Annelle.
It
analyzed
the
NPI
time,
openmp
imbalance
or
not
a
balance
or
not,
and
then
back
end
stalls
and
sim
D
and
memory
footprint
and
even
including
I,
also
I
think
this
is
a
really
good.
You
know
high-level
overview
for
the
code
performance
so
to
use
it
you
just
you
know,
load
the
module
run
this
script
and
then
here
you
at
runtime.
A
B
Hello,
so
my
name
is
Thomas
cassava
I'm,
a
postdoc
here
at
nurse
and
I'm,
going
to
talk
about
another
Intel
performance
tool,
Intel
advisor
and
in
particular,
about
new
roof
line,
features
that
have
come
up
in
recent
versions
of
advisor
and
I
just
want
to
thank
Johar
Matveev
from
Intel
who's,
one
of
the
main
developers
and
he's
working
quite
heavily
with
us
to
develop
and
test
new
features
of
advisor,
and
he
has
provided
me
with
a
lot
of
material
that
I'm
gonna
show
here.
So
so
I'll
go
in
a
bit
of
a
reverse
order.
B
I'll
have
an
I
have
examples,
but
at
the
end,
but
I'm
first
gonna
talk
more
high-level
about
what
you
can
do
with
advisor
so
so
advisor.
Has
this
it's
it's
basically
a
tool
for
vectorization
efficiency
of
your
code,
although
it's
kind
of
spreading
in
other
areas
as
well.
It
has
this
five
main
steps,
the
first
thing,
and
that
is
probably
what
what
most
people
will
be
happy
with
is-
is
it
sort
of
provides
you
with
compiler
diagnostics
and
performance
data
from
your
application
by
by
loop
and
by
source
code
line,
and
gives
you
information
about?
B
How
well
your
vectorizing,
why
it
thinks
you're,
not
vectorizing,
why
your
vectorization
efficiency
might
be
poor,
then
the
second
step
is:
it
gives
you
some
advice
on
on
how
to
fix
your
fix
issues
that
it
finds
based
on
what
what
Intel
thinks
is
is
a
good
way
to
vectorize
code,
which
can
sometimes
be
useful.
It
can
tell
you
things
like
you
have
a
dependency
here,
that's
preventing
your
vectorization,
you
should
you
can
remove
it
easily
by
these
transformations.
B
And
that's
that's
basically
like
the
basic
usage
of
advisor.
Then
you
can
run
additional
Diagnostics
to
collect
trip,
counts
and
flops,
and
that
basically
gives
you
an
idea
of
the
absolute
performance
of
your
code
on
the
system
and
that's
tied
to
the
roofline
analysis
that
I'm
going
to
talk
about
in
just
a
moment
and
then
you
can
do
even
more
detailed
analyses
like
dependency
analysis
or
memory
memory.
Access
pattern
analysis.
B
So
the
basic
workflow
goes
something
like
this.
So
you
start
by
compiling
your
code
and
what
Genji
said
about
compiling
for
vtune
is
basically
true
for
compiling
for
adviser
as
well.
I
think
you
can
run
it
with
with
static
linking
I've
done
it
and
it
hasn't
been
giving
me
problems,
but
you
run
with
with
minus
G
and
and
all
your
optimization
flags
on
too.
Obviously
you
need
that
for
vectorizing,
and
so
you
compile
your
code,
you
run
a
survey.
B
You
get
some
information
from
the
survey,
you
might
go
back
change
things
in
your
code
run
again
and
and
work
in
this
first
loop
for
a
while
and
then,
if
you
after
you
feel
you
might
need
more
information,
you
can
go
into
this
deeper
analysis.
Trip
counts,
dependencies
memory,
access
patterns,
but
all
of
these
will
will
run
on
the
same
binary
and
and
on
the
same
adviser
project
they
like
to
call
it,
so
you
will
just
be
adding
more
information
to
the
same
data
set
that
you're
collecting.
B
So
this
is
a
snapshot
of
the
summer
first
step
and
what
what
you
will
see
when
you
open
up
a
result
is
a
summer.
So
there
are
tabs
on
the
on
the
toolbar
that
contain
different
things,
but
the
summary
will
give
you
an
overview
about
the
performance
of
your
code.
So
so
some
things
are
similar
like
it
will
give
you.
It
will
tell
you
your
CPU
time,
but
then
you
get
metrics
like
how
much
of
your
code
execution
time
is
spent
in
vectorized
code.
B
B
Then
you
will
you
will
so
it
will
select
some
loops
up
here
that
it
sees
are
taking
up
most
of
your
time
and
give
you
some
additional
information,
and
so
you
can
basically
in
this
interface
you
can.
You
can
click
on
this
and
it
will
take
you
to
the
source
code.
Give
you
more
details
line
by
line
this
bottom
part
here
is,
is
comes
after
you've
run
the
memory
access
patterns.
B
I'll
talk
about
that
in
a
little
while,
then,
what
you
can
do
here
from
here
is
go
to
the
survey
report
and
it
will
give
you
more
details
on
on
a
loop
by
loop
basis,
so
this
so
advisor
likes
breaking
things
up
into
loops,
and
so
it
marks
which
of
your
loops
are
vectorized
with
with
an
orange
color
and
scaler
by
a
blue
color,
and
then
for
the
vectorized
loops.
It
will
tell
you
what
is
the
vector
efficiency
here
on
the
left.
B
Then
it
will
tell
you
what
vector
instructions
you're
using
how
much
it
thinks
you're
gaining
performance
from
vectorization.
What
is
your
vector
length
and
then
on
for
the
loops
that
it
thinks
are
not
not
vectorizing
or
are
vectorizing
inefficiently?
It
will
give
you
some
explanation:
why
why
this
is
not
vectorizing
and
maybe
what
you
can
do
with
it?
So
again,
there
are
links
here,
you
can,
you
can
click
and
it
will
take
you.
Take
you
to
the
more
detailed
advice.
B
B
The
other
part
of
this
is
actually
you
have
another
row
of
tabs
underneath
here,
so
you
can
go,
you
can
go
to
the
source
code
of
each
one
of
these
loops
and
it
will
give
you
byline
some
of
these
same
metrics
I,
want
to
highlight
here
the
the
code
analytics
tab,
which
is
which
I
like
a
lot.
So
it
gives
you
some
some
analysis
on
your
on
your
code
for
the
loop
in
a
sort
of
compact
way.
B
For
example,
it
analyzes
your
your
instructions,
how
what
fractions
of
your
instructions
are
spent
accessing
memory
or
computing,
or
maybe
maybe
a
mix
of
those
or
something
else,
and
it
gives
you
a
nice
summary
of
of
basically
all
the
information
that
is
here.
That
is
here
on
the
upper
row,
but
in
a
in
a
sort
of
clear
fashion.
B
So
so
from
here
you
can,
you
can
mark
the
loops,
you
want
to
do
deeper
analysis
on
and
then
run
run
those
analysis
so,
for
example,
the
the
memory
access
pattern.
So
this
so
the
basic
survey
has
a
pretty
low
overhead,
so
you
shouldn't
see
more
than
than
let's
say
less
than
50%
overhead
to
the
to
the
execution
of
your
code.
The
memory,
access
pattern
and
dependency
analysis
have
huge
overhead.
B
So
you
want
to
be
a
bit
selective
with
what
you
run
they
run
this
on,
but
you
can
mark
the
loops
that
you
think
might
be
interesting
for
this
analysis
and
analyze
them
and
then
here's
an
example
of
the
memory
access
pattern.
So
it
will
go
through
your
memory
axises
and
it
will
give
you
the
ratio
of
of
unit
stride,
fixed
stride
and
irregular
stride,
accesses
and
and
well
basically
for
vectorization.
You
want
the
unit
stride
as
much
as
you
can.
B
B
B
How
much
of
that
can
you
expect
to
achieve
so
on
the
on
the
plot
here
on
the
Left
I'm,
showing
a
cartoon
of
the
performance
bounds
of
some
system
that
has
two
levels
of
cache
hierarchy
and
and
DRAM,
and
then
is
capable
of
doing
scaler,
FMA
and
sim
D,
computations
and
and
basically
this
these
lines
here
give
you
performance
bounds
that
are
set
by
the
system
and
the
exactly
so.
The
y-axis
here
is:
is
performance
in
gigaflops
per
second
and
the
x-axis
is
arithmetic
intensity,
which
basically
means
how
many
flops
you
can
compute
per
byte.
B
You
might
notice
that
there's
this
quite
large
region
in
between
where
it's
a
combination
of
two
and
you
need
some
more
detailed
analysis
to
actually
figure
out
what's
going
on.
But
that's
where
advisor
can
help
you
so
okay.
So
this
is
basically
a
bit
of
a
bit
more
motivation.
So
so
you
essentially
you
you
might
want
to
think
about
using
the
roofline
model.
If
you
have
this
kind
of
questions,
so
you
have
an
application,
you've
managed
to
speed
it
up
and
you
want
to
know
well.
B
You
place
it
on
this
plot
and
you
try
to
analyze.
What
is
the
limiting
factor
so
and
like
I
said
before,
usually
it's
straight
forward,
so
you
might,
you
might
get
get
gettin
in
in
some
cases
special
cases
you
might
analyze
your
application.
It
will
sit
here
on
the
memory,
bandwidth,
DRAM
memory,
bandwidth
line-
and
you
will
say:
oh
okay,
so
I'm
bound
by
DRAM
memory.
Bandwidth
so
I
need
to
improve
my
my
AI
to
get
more
performance
or
to
reuse
my
cache
better
to
get
better
performance.
B
So
so
what
advisor
can
do,
because,
in
order
to
to
place
the
application
on
the
roofline
plot
you
need
to
so
you
need
to
know
how
many
flops
per
second
year
the
application
is
computing
and
how
much
data
it's
moving
from
memory
and
an
advisor
can
do
both
of
these.
For
you
and
and
plot
the
your
application
on
the
roofline.
So
so
I
have
an
example
here
from
advisor,
and
this
is
something
you
also
find
under
their
survey
tab.
So
you
go,
you
go
to
the
survey
and
then
in
2017
and
newer,
newer
versions.
B
B
B
B
Okay,
so
what
so?
What
it's?
What
roofline
and
what
advisor
is
doing
to
compute
this
roofline
is.
It
needs
to
measure
that
the
time
it
takes
to
run
your
application,
it
needs
to
measure
how
many
flops
the
application
is
computing
and
it
needs
to
measure
how
many
bytes
it's
it's
moving
from
memory.
So
so
this
is
done
in
in
two
steps,
so
the
survey
collection
of
advisor
counts
the
time
and
then
you
need
to
run
a
second
collection
which
is
through
counts
that
counts
the
flops
and
the
bites
the
the
flop
count
on.
B
If
you
run
this
on
K
and
L,
it
is
mask
aware
so
on
K
and
L.
If
your
code
is
not
using
all
the
vector
lines,
some
of
them
will
get
masked
out
and
you
risk
over
counting
the
flops,
but
but
what
advisor
is
doing
actually
is
because
there
is
no
flop
counter
on
the
K&L,
its
count,
its
instrument,
it's
counting
the
instructions
and
working
out
the
flops
from
there
and
and
it's
also
taking
into
account
the
masks.
B
Okay,
so
when
so
you
place,
you
collect
this
information,
you
place
your
application
on
the
on
the
roof
line.
So
what
do
you
do
with
this
information?
You?
You
should
see
whether
your
application
is
compute
bound
or
memory
bound,
and
that
should
give
you
some
idea
of
what
kind
of
optimizations
you
can
apply
to
that
application.
B
So
compute
bound
applications
generally
benefit
from
parallelization
and
vectorization
and
memory
bandwidth,
bound
application
from
cache,
reuse
and
memory
alignment
or
or
using
the
high
bandwidth
memory
and
care
now
so,
okay,
so
I'm
gonna
go
through
now
step
by
step.
How
you
run
advisor
together,
get
the
survey
and
the
roof
line,
and
so
the
first
thing
you
you
need
to
do
is
load
the
advisor
module
and
that
should
set
the
path
to
the
binaries.
Then
so
we
recommend
running
advisor
on
the
command
line
and
then
viewing
the
results
from
the
GUI.
B
B
B
This
report
serve
a
flag
on
the
command
line
and
that
will
write
out
everything
that
all
the
information
in
the
survey
into
CSV
file
that
you
can
then
open
in
whatever
text
editor
or
sort
of
spreadsheet
software
you
like
and
that's
that's
the
basics.
So
if
you
want
to
run
this
on
a
on
an
MPI
application
using
or
using
our
slurm
scheduler,
it's
it's
not
that
different.
B
The
only
thing
you
have
to
remember
is
that
you
have
to
call
s
run
on
a
computer
note
and
and
then
give
them
your
normal
arguments
to
s
run
and
then
just
put
advisor
as
the
executable
and
then
give
your
executable
as
an
argument
to
advisor
like
like
this
example
show
here
so
otherwise
this
is
pretty
similar.
You
might
also
want
to
add
the
data
limit
flag
to.
So,
if
you
set
data
limit
to
zero,
it
basically
means
that
there
is
no
data.
B
B
Okay,
so
that
should
be
enough
to
to
get
started
with
advisor.
So
there's
I've
collected
some
links
here,
so
the
first
one
is
our
nurse
advisor
page,
which
is
actually
really
good.
There's
a
lot
of
good
advice
in
there
and
then
I've
got
a
couple
of
papers
about
the
roof
line
and
then
a
lot
of
Intel
and
nurse
resources
on
on
using
advisor.
C
So
one
thing
I
saw
mentioned
that
the
profile
of
you
loved
it
was
not
using
performance
counters
alright.
So
if
you
have
a
application
with
a
difficult
to
predict
an
acute
case
bones
with
using
caches,
but
perhaps
it's
you
know,
index
T,
something
not
as
obvious.
Let's
to
work
will
not
be
able
measure
density
so.
B
So,
oh,
that's
a
good
point
so
that
what
what
the
current
versions
of
of
advisor
mean
by
arithmetic
intensity
is
is
all
the
data
that
is
that
this
moved
from
any
level
of
memory
hierarchy
into
the
processor.
So
it's
you
could
you
could
say
it's
an
l1
arithmetic
intensity.
So
so
it
will
collect
everything
where
we're
working
with
Intel
very
hard
at
the
moment
to
get
a
version
that
would
count
the
arithmetic
intensity
just
for
traffic
out
of
DRAM,
because
that
is
is
yeah
more
likely.
B
A
bottleneck
and
well
I
mean
this
to
have
sort
of
different
uses
and
and
they're
actually
both
sort
of
they
complement
each
other.
But
this
seems
to
be
more
difficult
for
them
to
implement.
So
so
it's
I
hope
it
will
come
in
in
future
releases.
They
have
a
beta
version,
that's
kind
of
working
at
the
moment.