►
From YouTube: 7 Case Study 2: VASP (NERSC Cori KNL Training 6/2017)
Description
From the Cori KNL Training held June 9, 2017. For slides please see http://www.nersc.gov/users/training/events/cori-knl-training-2/
A
Okay,
so
this
is
supposedly
a
talk
about
how
the
bus
forget,
Optima,
optimized
and
run
well
on
K&L,
but
I.
Think
most
of
the
optimizations
are
done
by
the
developers
so
recently,
or
we
have
a
paper
actually
summarized
all
the
of
the
works
relevant.
There
is
another
one
as
well,
but
I.
Think,
oh,
why
don't
I
just
say
as
a
nurse
participation?
What
do
we
have
achieved?
I
think
that
would
be
a
another
case
study
as
well.
So
what
I
want
to
show
are
present
in
this
industry
party
is
a
case.
A
Study
is
actually
the
code
optimization
actually
was
done
by
the
developers,
actually
their
effort
to
start
it
like
a
long
time
ago,
like
probably
several
years
ago.
So
the
major
effort
is,
they
add
the
open,
MP
directives
to
the
pure
MPI
code,
so
the
the
code,
actually,
the
the
currently
official
release
of
the
code,
is
there
in
MPI
only,
but
this
hybrid
code
actually
is
under
the
beta
testing
and
I
think
very
soon.
A
A
A
So
fortunately
the
important
work
has
been
done.
Optimization
work
has
been
done.
Actually
nurse
participated,
some
some
like
exploring
the
MCD
Rams
performance
impact
into
the
cold.
We
we
help
the
developers
through
the
dungeon
session
with
Intel,
and
that's
happened
like
2015,
I,
think
and
then
another
one
I
think
we
helped
with
is
just
to
help
them
to
do
this
beta
testing
and
also
the
explore
those
executions
based
parameters
so
to
get
the
code
to
perform
better.
So
this
is
the
current
performance
it
achieved.
A
So
the
blue
bar
is
the
the
pure
MPI
code
and
the
the
red
bars
are
the
the
hybrid
MPI
openmp
code.
It's
a
it's
optimized,
the
code.
So
what
we
can
see
now,
if
we
look
at
the
second
bar
group,
so
the
one
got
the
star.
That's
the
best
performance
we
get
now
and
compared
to
the
in
a
pure
MPI
code,
it's
about
two
to
three
times
the
speed
up.
So
this
is
big
performance
again
over
here,
but
that
was
not
possible
if
we
just
depend
on
what
we
handed
over
from
Intel.
A
Actually,
when
we
started
to
look
at
the
performance
on
Cori,
actually
we
see
about
I
think
about
30%
to
200%
of
the
lower
numbers
compared
to
the
what
Intel
provided
us,
but
the
way
did
some
performance
tests
and
I
think
over
the
data
time
period.
Also,
we
had
a
system
improve
I
guess
so
in
any
way.
After
this
excuse
and
space
exploration,
I
think
we
brought
up
the
the
performance
on
Cori
K&L,
so
the.
A
So
I
so
I
was
debating
like
between
talking
about
the
optimization
and
then
the
performance
anyway.
I
don't
have
time
so
I
would
say:
I
just
want
to
present
roughly
at
the
high
level.
I
think
the
code.
Optimization
is
just
adding
the
NPI
up.
I
mean
the
OpenMP
directive
to
the
the
NPI
pure
NPI
code,
and
then
the
major
challenger
here
is
how
you
how
they
are.
You
know,
leverage
the
three
levels
of
different
parallelism
inside
the
code.
A
So
basically
the
NPI
is
a
top
level
and
then
the
second
level
is
the
threading
multi-threading
using
open,
NP
and
then
the
next
level
to
be
Sindhi
vectorization.
So
that's
the
major
challenge
that
they
have
they
have
addressed,
and
then
we
have
a
we
can
we
I
don't
today.
I
want
I
want
to
skip
this
part,
but
basically
they
add
a
lot
of
effort
to
symbolize
the.
A
The
code,
so
the
code
is
really
well,
you
know
vectorized,
so
the
one
I
want
I
want
to
present
is
the
what
we
helped.
You
know
as
a
nurse
effort.
So
basically,
we
collected
the
actually
six
different
test
cases,
so
we
use
these
test
cases
to
to
test
the
performance
so
that
our
intention
is
to
cover
the
typical
workload
at
the
nurse
and
also
we
also
are
so
the
benchmark.
Okay.
So
we
try
to
include
all
the
different
consideration.
Combine
those
consideration
together.
So
one
is.
A
The
reason
we
do
this
is
actually
K&L
provides
a
lot
more
flexible
execution
and
then
put
the
time
in
and
all
these
options
are
really
flexible.
So
we
try
to
explore
all
the
possible
combinations
of
the
you
know:
the
execution
space
parameters
so
to
the
at
the
same
way
exploded.
Sure
it's
a
new
code.
We
are
familiar
with
the
pure
MPI,
only
called,
but
not
here
to
this
hybrid
code.
So
we
did
first,
the
thread
su
Kalyan
and
those
who
were
close
actually
are
from
the
production,
the
workload.
A
So
it's
not
yet
on
addressing
the
likely
he
wrote,
you
know,
B
runs,
but
it's
just
the
medium
one.
We
can
see
it's
up
to
16
nodes
we
tested.
So
this
is
the
one
of
the
test
case,
which
is
a
really
typical
DFT
calculation
case,
and
then
we
see
the
thread
scaling
and
compare
with
the
head.
Well
result.
A
So,
basically,
what
we
see
is
the
the
code,
the
hybrid
ACOTA
skills
up
to
four
to
eight
threads
in
most
of
the
runs
different
in
all
the
runs,
and
then
also
we
compare
with
the
Haswell
result
and
we
can
see
around
the
like.
For
knows,
they
have
a
similar
performance,
but
to
this
one
after
that,
then
we
see
has
well
is
better,
but
in
another
code
path.
This
is
a
relatively
more
memory
intensive
workload.
A
This
is
something
called
a
hybrid
functional
calculation
and
we
see
that
when
the
node
account
increases
eventually
the
has
well
catch
up,
but
we
can
see
that
performance
difference
is
much
larger
here,
but
the
Sudhir
is
here
good
strategy
getting
up
to
eight
actually
I
didn't
including
the
presentation
actually
in
the
paper.
We
also
looked
into
the
same
system.
A
We
look
at
into
what,
if
the,
what
the
scaling
look
like
if
we
use
32
threads
per
task
and
go
even
further
like
32,
16
2
knows,
and
it
was
there's
scaling,
so
it
is
I
think
actually
for
the
for
adding
openmp.
Actually,
our
goal
is
basically
I
mean
for
many
of
the
code.
The
goal
is
to
match
the
pure
MPI
performance.
A
It
just
it's
not
easy
easy
to
see.
Let's
say
on
as
well
something
like
this
multi-core
system.
Actually,
it's
hard
to
expect
the
hybrid
code
can
be
to
the
you
know,
pure
MPI
code,
but
in
this
case
this
code,
I
I,
think
we're
pretty
good
with
a
OpenMP.
There
is
a
good
speed
up
decent
amount
of
speed
up
over
there.
A
So
those
are
similar-
okay,
maybe
I
just
mentioned
this
sitting
on
this
is
bikes,
is
a
really
interesting
one
or
we
don't.
We
are
not
too
sure
what
happened
there,
but
this
is
the
case.
These
spikes
happened
only
Randy
the
code.
You
were
you
just
the
one
thread
per
per
task,
and
also
we
see
for
the
16
previous
one.
Actually,
we
had
another
spikes.
A
This
one
had
a
it
happens
on
the
16
thread
there
and
then
recently
we
looked
into.
Why
it
look.
We
got
that
kind
of
numbers
and
basically
the
looks
like
it's.
Some
sort
of
interaction
between
the
huge
pages
were
used,
but
anyway
only
these
two
educators.
We
see
some
in
a
performance.
You
know
negative
performance
from
the
using
hype,
huge
pages,
but
the
when
we
do
this
test.
Actually
our
recommendation
is
the
using
huge
pages.
All
the
time
and
I
think
this
is
partially
is
the
some
interaction
with
the
system.
A
I
guess
configuration
and
still
really
needed
to
further
investigate.
But
anyway
we
first
investigated
the
thread.
Parallel
scaling
I
will
skip
all
those,
and
then
we
also
looked
at
the
memory
mode.
So,
basically
nurse
could
the
recommendation
actually
was
like
either
using
the
flattened
or
quad
flat
mode
or
quad
Akash
nose.
Our
test
is
started
from
there
and
then
the
tester
shells
are
they
are
very
similar.
This
is
what
we
get
to
it
with
vast,
with
some
like
spikes
and
those
spikes.
A
Unfortunately,
they
are
reproducible
and
it's
not
like
a
random
thing
and
we
indeed
needed
to
investigate.
So
basically,
our
conclusion
is
all
these
are
test
cases.
Actually
they
fit
in
the
MACD
MCD
run.
Well,
so
it's
not
like
a
big
case,
so
we
see
that
as
long
as
it
the
case
can
fit
into
the
M
CD
ROM
memory.
A
We
didn't
see
the
performance
difference
between
flat
and
and
cache
mode,
so
our
recommendation
would
be
just
to
use
the
cache
mode,
so
we
also
look
at
into
hyper
threading
and
basically
our
conclusion
I
just
want
us
to
keep
this,
but
basically
this
is
a
very
typical
hyper
threading
effect
for
the
DFT
code.
So
when
you
use
like
a
small
node
account,
then
the
hyper
threading
help
is
a
little
bit,
but
the
laser
portion
actually
in
the
scaling
region,
parallel
scaling
region
actually
hyper-threading,
doesn't
help
x-ray
to
slow
the
damn
code.
A
Sudanese
countries
always
recommend
the
users
to
you
them.
Just
not
use
hyper-threading
and
huger
pages.
Are
this
one
we
also
looked
into
and
and
basically
in,
although
all
the
cases
actually
we
observed
by
in
our
test
is
the
performance
is
better
with
without
huge
pages
or
the
similar
or
better?
We
didn't
see
the
slowdown,
but
the
one
I
just
mentioned
at
the
beginning.
I
guess
that's
a
some
sort
of
interaction
with
the
system
issues,
so
we
need
to
investigate.
A
But
anyway,
that's
our
huge
page,
the
test
and
then
also
it's
very
important
one
is
we
looked
at
compilers
and
the
libraries
so
actually,
the
like
other
DFT
calls
vast,
also
uses
actually
heavily
dependent
on
the
library.
The
performance,
especially
the
mkl,
know,
especially
the
linear,
algebra
and
then
FFT,
it's
the
major.
It
counts
for
like
major
portion
of
the
execution
time.
A
Actually,
it
is
very
important
to
look
into
this
part,
so
we
did
our
compile
the
code
with
different
compilers
and
the
data
some
use,
the
different
combination
of
library
and
compilers,
and
we
noticed
the
best
one
is
a
spear
Intel
+,
+,
mkl,
linear,
algebra
and
FFT
FFT
routine
from
mkl.
That
one
is
the
350w
interface
provided
by
MK
l.
So
basically
we
can
see
the
best.
One
is
somewhere
here
that
the
blue
bar
the
green
bar
over
here.
A
So
that's,
basically,
is
a
as
here
krei
is
it
of
BlueBox
are
equilibar
over
here,
so
it's
Intel,
the
best
one
is
Intel,
plus
mkl
and
then
securely
attack.
We
we
use
the
alpha
library
and
that
one
is
the
best
one.
So
this
is
a
similar
one,
so
we
did
come
up
with
some
best
practice
tips
for
our
users.
So
we
are
pretty
confident
by
the
time
like
July,
when,
when
we
when
Corey
enters
production,
I
know
starting
charging,
it's
already
entered
production.
So
then
we
start
to
charge
the
the
time
on
Corey
K&L.
A
Actually,
the
code
could
be,
you
know,
running
efficiently,
so
we
came
up
with
some
good
practices
for
users,
so
basically
the
how
many
surest
to
use
this
is
the
one
important
recommendation
for
them.
So
a
thesaurus
or
phosphorous
is
good
one
to
use
in
all
situations
and
then
the
some
of
our
users
also
compiled
code
by
themselves.
Although
we
provide
our
precompiled
binaries
for
various
reason,
they
would
like
to
have
their
own,
so
we
recommend
Intel,
compiler,
mkl
and
the
if
they
want
to.
A
In
in
our
case,
so
we
have
six
to
eight
physical
cores
on
a
note,
but
to
invest
for
a
recommendation,
64
cores.
This
is
not
just
because
of
easy
to
divide
the
thread
and,
and
the
task
actually
wait.
This
is
actually
the
FFT
performs
better
with
to
the
power
of
the
course
here.
So
that's
all
all
I
have
for
this
case
study.