►
From YouTube: The Future of High Performance Scientific Computing
Description
The Future of High Performance Scientific Computing presented by Berkeley Lab Associate Director of Computing Science Kathy Yelick at NUG 2013, the annual meeting of the NERSC Users Group.
A
We
welcome
everybody
to
the
lab.
This
was
a
interesting
request
to
talk
to
the
nug,
not
as
the
nurse
defender
so
sudeep's,
going
to
tell
you
all
about
the
future
of
nurse
and
I'm,
going
to
tell
you
about
some
other
things
and
putting
some
stuff
I've
been
kind
of
related
to
work
that
I've
been
doing
in
research,
but
to
try
to
tell
you
where
but
I
think
the
important
problems
are
in
computing
as
you
look
forward
towards
exascale
platforms,
and
things
like
that,
I
did
want
to
just
welcome
all
of
you
to
the
lab
whoops.
A
So
what
am
I
doing
now
that
I'm
no
longer
nurse
director
I'm
at
the
associate
lab
director
and
what
that
means?
Is
that
I'm
actually
in
charge
of
the
nurse
es
net
and
the
computational
research
division?
So
I
think
most
of
you
know
that
as
well,
but
I
did
want
to
put
in
one
kind
of
plug
for
or
discussion
about,
es
net,
because
I
don't
think
the
nurse
people
are
necessarily
often
layer
with
es
net
and
for
those
of
you
who
are
at
remote
institutions
or
are
involved
in
science
projects
at
other
institutions.
A
It's
important
to
understand
that
the
network
is
also
an
instrument
that
you
can
use
in
science
and
at
one.
It
is
one
that
you
might
want
to
work
with
people
at
nurse
who
can
also
hook
you
up
with
people
at
es
net
in
order
to
figure
out
how
to
move
large
datasets
around.
So
yes,
net
has
just
recently
gone
through
an
upgrade
yes
net.
Five
is
in
production,
which
is
the
first
100
gigabit
per
se,
Transcontinental
or
continental
scale
network.
It
has
upgrade
capacity
in
the
form
of
dark
fiber.
But
what
really
this
is?
A
This
there
are
facilities
that
you
can
use
to
really
send
huge
data
sets
around,
and
you
know
the
I
guess.
One
of
the
numbers
that
I
don't
have
on
here
is
what
happens
if
you
tried
to
send
these
large
data
sets
like
terabyte
data
sets
around
on
another
network.
They
were
I,
think
Eli
dart
was
recently
doing
a
row.
Brian
Tierney
we're
doing
a
project
where
they
were
transferring
data
between
Oregon
and
Berkeley
Lab,
and
they
found
that
because
of
some
of
the
loss
and
the
internals
of
the
network.
A
So
normal
networks
will
drop
packets
once
in
a
while
and
things
will
get
retried
and
it
works
fine.
But
you
try
to
do
this
with
a
large
science
flow
in
the
middle
they
saw
the
the
the
science
bandwidth
the
bandwidth
slowed
down
by
a
factor
of
80.
So
why
is
this
important
to
you?
If
you
have
a
large
data
set
at
nurse
that
you
want
to
put
someplace
else
or
another
data
set
someplace
else
that
you
want
to
bring
to
nurse?
A
You
really
want
to
make
sure
that
that
network
path
between
the
two
facilities
is
well
optimized.
There's
a
bandwidth
reservation
service
in
the
yes
Nate
called
Oscars,
and
if
you're
sending
large
data
sets
around-
and
you
don't
know
about
that,
you
should
there's
also
information
about
how
to
set
up
a
you
know:
data
transfer
nodes
or
things
we
have
at
nurse
this
science
dmz
idea
to
really
build
a
high-speed
connection
on
the
other
end.
So
there's
information
about
how
to
do
that.
So
a
little
bit
about
Big
Data
within
do
II.
A
So
there's
a
picture:
the
artist
rendition
of
the
new
building
and
some
of
the
specs
over
there
about
how
how
big
it
is
and
how
expensive
it
is,
and
things
like
that,
it's
a
very
it's
a
very
efficient
building
and
it's
going
to
be
a
very
efficient
data
center
scientific
computing
center.
It
is
one
of
the
probably
the
most
energy
efficient
ones
within
do
e
be
most
within
the
office
of
science,
and
that
is
because
the
temperature
that
you
experience
outside
is
the
temperature.
A
Almost
a
year
round
here
at
Berkeley,
it's
very
rare
to
have
very
hot
temperatures,
which
means
you
can
use
the
ambient
air
temperature
to
actually
cool
the
computers
in
the
building.
And
you
know
we
did
a
little
analysis-
a
back-of-the-envelope
calculation
in
2010
that
was
when
Hopper
was
it
was
when
we
just
had
Franklin
I
had
haven't
updated
this
yet
for
just
the
hopper
configuration,
but
we
had
about
200
publications,
250
publications
per
megawatt
year.
Ok,
so
I
challenge
any
other
computing
Center
to
produce
that
many
papers
in
a
megawatt
year.
A
So
this
is
just
abiding
the
number
of
publications,
but
a
number
of
megawatts.
We
use
that
year
and
you
know
there's
a
lot
of
ways.
We
can
look
at
efficiency
and
that's
what's
going
to
come
up
when
I
talk
about
the
future
and
what
we're
worrying
about
in
the
future
machines.
So
enough
about
the
lab,
by
the
way,
I'm
also
very
happy
to
answer
any
questions
that
people
have,
because
it's
a
small
audience
so
might
as
well.
A
Ask
you
take
advantage
of
that
all
right,
so
I
wanted
to
talk
a
little
bit
about
the
future
of
computing
and
kind
of
start
with
by
looking
back
at
the
past,
and
I
was
just
actually
talking
to
a
computer
scientist
in
human-computer
interfaces.
We
can
really
cool
problems.
You
know
how
to
get
rid
of
laptops
entirely
by
just
projecting
things.
On
your
on
your
hand
and
and
using
you
know,
motions
of
your
fingers.
A
So
you
don't
even
have
to
carry
a
laptop
around
anymore
and
he
said
oh,
but
my
computer's
fast
enough,
I
really
don't
need
anything
faster
than
that
which
is
a
common
perception.
I
think
that
most
computer
scientists
have-
and
so
I
like
to
do
this
little
thought
exercise
where
I
say
well.
What
are
two
of
the
things
that
we
really
care
about?
A
Everybody
in
the
world
cares
about
even
if
you're,
not
a
computational
scientist,
where
you
might
understand
more
about
the
importance
of
high-performance
computers
and
that
is
and
I've
some
kind
of
a
smartphone-
it's
let's
say
an
iPhone
and
a
and
searching
Google.
So
if
we
so
in
2013,
you
know
these
are
commonplace.
We
use
them
all
the
time
we
use
them
multiple
times
a
day
and
if
you
roll
back
into
1993,
you
know
what
do
these
devices
look
like?
Well,
you
know
certainly
there's
a
lot
of
really
creative
user
interfaces.
A
These
things
there's
creative
algorithms
in
both
Google.
You
know,
there's
asymmetric
eigenvalue
problem
inside
of
the
the
page
ranking
algorithm.
There's
a
you
know,
there's
a
bunch
of
speech,
recognition
and
so
on
in
the
iPhone.
But
if
you
roll
back
20
years,
you
end
up
with
you
know
the
nervous
supercomputer
on
your
in
your
hand
right.
A
So
so
it
you,
you
needed
all
those
creative
people,
all
those
creative
algorithms,
all
that
new
software,
but
you
also
needed
faster,
smaller,
cheaper,
denser
computers,
and
so
you
know,
as
we
move
forward
people
say:
oh,
we
don't
need
any
faster
computers,
but
you
do
in
order
to
get
innovations
like
this.
You
really
do
need
computers
that
are
gonna,
be
much
faster,
much
smaller,
not
much
cheaper
and
Google.
So
you
know
how
much
you
know:
what
would
you
need
to
have
Google?
Well?
First
of
all,
you
need
a
few
gigawatts
of
power.
A
So
where
do
you
get
a
few
gig
say
30
gigawatts
of
power,
which
is
about
what
you
would
be
using
in
power
in
1993.
If
you
tried
to
build
a
Google,
Data,
Center
kind
of
estimated
for
what
you
think
is
inside
a
single
Google,
Data
Center
well,
Google,
of
course,
is
a
green
company.
They
like
to
advertise.
You
only
use
green
power.
So
where
could
anybody
tell
me
where
can
you
find,
let's
say,
20
to
30
gigawatts
of
hydro
power,
so
cheap
Green,
hydro
not
necessary,
but
Green,
hydro,
power,
Canada.
A
A
But
if
we
halted
the
progress
on
computers
in
1993
and
just
had
the
well
progress
as
such,
as
is
in
other
things,
then
that's
where
we
could
get
that
much
hydropower.
So
so
now
I
thought
exercise
rolling,
20
years
forward,
and
this
is
always
really
dangerous
because
it's
really
hard
to
make
these
kind
of
predictions.
But
the
first
prediction
is:
there's
no
personal
computers,
there's
no
departmental
computers.
A
There
are
only
client
devices
which
are
embedded,
perhaps
and
as
I
said,
but
you
know
people
are
trying
to
get
rid
of
keyboards
and
they're
trying
to
get
rid
of
screens,
and
so
we
may
not
even
really
see
computers
and
there's
and
then
there's
the
cloud
right
so
and
the
cloud
is
we're,
including
you
know,
we
don't
like
to
call
norske
cloud,
but
it's
there.
It's
a
place
that
you
can
do
scientific
kind
of
computing
and
we
don't
travel
very
much
because
we
do
a
lot
more
telepresence.
A
Wouldn't
that
be
nice
lecturers
teach
to
millions
of
students.
You
know
we're
running
one
of
these
MOOC
courses
on
campus
at
UC
Berkeley.
You
know
these
are
courses
that
have
tens
of
thousands
of
hundreds
of
thousands
of
students
in
them.
It's
a
really
interesting
teaching
experience
from
what
I've
been
told
they
haven't
tried
one
yet,
but
don't
ever
one
of
the
rules
is
never
hand
out
a
homework
assignment.
That
has
a
mistake
in
it.
That's
the
first
rule
of
teaching
a
MOOC,
so
you
know
theorems,
you
know
theorems
might
be
proven
online.
A
If
you
aren't
familiar
with
this
thing,
you
look
at
a
webpage
called
polymath,
so
you
know
there's
sort
of
more
automatic
thing,
maybe
that
one's
a
little
bit
more
of
a
stretch-
users
never
login
to
the
nurse
system.
So
this
I
think
is
actually
going
to
happen
sooner
and
it's
already
happening
today
right.
There
are
a
lot
of
people
that
actually
use
a
nurse
that
don't
directly
log
into
the
systems.
Probably
most
of
you
who
are
nursing
users
in
this
room
actually
do
login
and
you
submit
jobs
and
things
like
that.
A
So
we've
had
a
big
debate
about
how
we
count
all
these
people
that
use
nurse
indirectly,
computers
Intuit
what
jobs
should
be
run
so,
okay,
this
one
might
sound
kind
of
you
know
crazy
too,
but
the
you
know
this
also
is
sort
of
that
idea
behind
some
of
the
gateways.
So
if
you
look
at
something
like
the
materials
Genome
Project,
where
you've
got
tens
of
thousands
of
simulations
being
run,
it
is
not
unreasonable
for
some
algorithm
to
say
you
know:
here's
a
space
of
the
here's
part
of
the
design
space.
A
If
there's
enough
structure
on
it
that
we
think
should
be
filled
in
by
simulations
or
the
user
asks
queries
coming
in
from
a
web
interface
about
some
particular
material
and
a
bunch
of
jobs,
get
run
and
run
based
on
what
that
is.
So
you
know
the
idea
that
you're
not
directly
logging
in
and
submitting
batch
jobs
and
so
on
is
I.
Think
not
not
such
a
crazy
idea.
No
users
actually
visit
the
other
user
facilities
either.
So
it's
already
the
case.
Why
do
we
have
so
few
people
in
the
room?
A
Because
people
don't
really
come
to
nurse
to
use
nurse
right?
They,
mostly
all
of
you
just
log
in
remotely,
and
what
surprised
me
I
was
I,
was
at
a
meeting
where
somebody
was
talking
about
big
data
and
data
and
science
and
data
and
medicine,
and
things
like
that,
and
this
person
was
actually
doing
medical
experiments
and
I
figured
while
medical
experience.
You
have
to
be
there
right,
you
have
to
be
there
with
the
subjects
and
they
said
well.
A
You
know
sending
the
material
or
whatever
it
is
to
the
light
source
and
having
somebody
there
run
it
and
it
changes
the
model
of
what
the
user
facilities
are.
So
we
need
to
think
about
what
this
means
for
the
kind
of
science
that
dearie
does
for
big
team
science
and
so
on
and
I'll
leave
that
for
you
to
think
about
okay.
A
So
my
next
kind
of
high
level
discussion
is
about
the
world
of
high-performance
computing
and
the
politics
of
it,
and
this
kind
of
big
data
versus
exascale
discussion
that
has
been
going
on
for
a
while.
Now
and
unfortunately,
it's
there,
it's
been
cast
as
a
versus
in
the
in
the
discussions,
but
I
think
it's
important
to
go
back
and
think
about
with
India.
We,
where
did
where
did
high
performance
computing
kind
of
grow
up
in
terms
of
the
the
growth
of
the
HPC
program
with
India,
and
it
started
with
I.
A
Think
the
comprehensive
nuclear-test-ban
treaty
on
the
NNSA
side,
which
really
said
that
you
have
to
use
modeling
and
simulation
because
we're
very
restricted
and
kind
of
experiments
that
we
can
do
and
so
that
sort
of
said,
the
balance
between
doing
data
analysis
on
the
one
hand
and
simulation,
on
the
other
hand,
kind
of
shifted
more
towards
simulation
within
the
NSA
and
I.
Think
that
the
dothis
of
science
took
advantage
of
that
and
the
Oscar
program
and
said
yes.
A
Well,
there's
a
lot
of
important
size
problems
that
can
be
done
with
simulation
in
science
as
well,
and
the
the
focus
has
been
on
simulation
rather
than
on
data
analysis.
Now
at
the
moment,
whoops,
because
you
know,
there's
there's
these
huge
growth
in
data
rates
coming
from
CCD
technology
coming
from
sequencing
technology
and
so
on.
We're
seeing
a
shift
towards
data
analysis
that
there
are
big
data
problems
that
are
coming
from
things
like
the
next-generation
light
source
plans
here
from
the
Bell
to
experiment
from
from
the
sequencers
at
JDI
and
so
on.
A
Producing
huge
data
sets,
ok
and
by
the
way
it
doesn't
really
matter
what
the
balance
is
up
here,
because
both
of
these
things
rely
on
having
faster
computers
and
that
kind
of
goes
back
to
your
the
little
iPhone
and
Google
exercise.
You
really
need
to
have
faster
computers
up
or
in
cheaper,
more
plentiful
computation
in
order
to
solve
some
of
these
problems.
Ok,
so,
let's
see
I'll
say
a
little
bit
about
the
kind
of
science,
transit
and
I
think
these
are
some
of
different
examples.
Then
so
deep
will
use.
A
But
these
are
these
are
examples
from
nurse
but
I
like
to
think
of
the
science
that
we
all
do
at
computation
as
being
divided
up
between
large-scale
science
that
is
petascale
up
to
exascale
simulations.
What
I
call
volume
of
science
and
volume
which
is
about
running
massive
numbers
of
simulations?
Some
people
call
it
capacity
computing,
but
it
I
think
it's
actually
something
a
little
bit
more
well
defined
than
that
which
are
people
that
want
to
run
ensembles.
A
That
are
very
that
are
closely
related
to
each
other,
and
we
need
to
have
support
for
those
kinds
of
ensemble
simulations,
whether
you're
doing
uncertainty,
quantification
or
some
kind
of
screening
through
biology
data
or
materials
data
or
what
and
then
the
data
analysis
side
of
things
where
you've
got
huge
data
sets.
So
you
know
things
get
oversimplified
and
I
think
the
one
thing
I
want
to
make
sure
everybody
in
this
room
understands
is
that
exascale
is
not
only
about
the
top
thing.
A
It's
about
technology
needed
to
solve
any
of
these
problems
that
require
more
computing
performance.
So
science
XK
limit
models,
of
course,
are
very
large-scale
computations,
although
they
also
do
run
a
large
number
of
them.
This
is
just
you
know
a
slide
about
some
of
the
the
history
of
climate
modeling
at
nurse
nurse
has
been
involved
in
client.
The
IPCC
runs
a
climate
since
the,
but
certainly
since
a
are
for-
and
you
know
going
forward,
why
do
we
care
about
faster
computing
and
climate
modeling?
A
Well,
one
of
the
examples
is
because
you
want
to
do
cloud
resolution
you.
If
you
want
to
resolve
clouds,
you
need
to
have
a
computer
that
is
significantly
faster,
I,
think
Gil,
isn't
here,
but
I
Gayle,
compo
who's,
doing
more
of
the
data
analysis
side
of
climate
change.
When
we
were
talking
at
the
BER
requirements
workshop,
he
also
mentioned
that
he
needed
you
know
about
a
hundred
times
more
computing
power
in
order
to
analyze
to
reconstruct
datasets.
A
So
he's
doing
this
20th
century
re
analysis,
which
is
reconstructing
data
from
from
very
sparse
datasets
that
exists,
and
he
said
in
order
to
get
some
of
the
effects
of
get
things
like
cyclones
and
stuff
in
back
into
the
reconstructed
data
he
needs
to
have
faster
computation,
because
what
happens
right
now
is
you're
kind
of
averaging
over
this
very
sparse
and
very
noisy
data
set,
and
you
you'll
average
out
some
of
these
kinds
of
interesting
local
events.
So
here's
the
materials
genome,
one
and
I
won't
say
a
lot
more
about
it
other
than
you
know.
A
The
the
goal
here
is
to
is
to
increase
the
to
decrease.
That's
the
amount
of
time
that
it
takes
to
get
from
the
design
of
a
new
material
into
manufacturing
to
cut
that
in
half
it's
about
eight
eighteen
months,
I
think
you
should
say
right
there.
The
the
the
delay
that
doesn't
design
time
is
eighteen
years
David.
It
is
18
years,
yeah,
sorry
and
the
you
know.
A
The
idea
is
to
search
through
a
whole
space
of
related
materials
and
cut
down
the
the
interesting
part
of
the
space,
so
that
you're
not
you're
going
when
you
go
back
into
the
lab
and
synthesize
things
that
you're
not
actually
or
searching
for
things
that
exist,
that
you're
not
searching
through
the
entire
space,
and
so
this
gets
into
the
case
where
you
really
want
a
sophisticated
interface
to
being
able
to
drive
the
simulations.
You
don't
want
to
submit
each
one
of
these
jobs
one
at
a
time
and
then
in
the
genomics
area.
A
You
do
see
this
kind
of
growth
in
the
computing,
the
computing
performance,
both
from
the
Linpack
benchmark,
but
also
from
the
gore
bell
prizes,
which
are
by
the
way,
also,
of
course,
very
highly
optimized
codes.
Now
I
want
to
make
one
kind
of
side
comment
about
the
cost
of
running
nurse
and
the
cost
of
cloud
computing,
because
there
is
because
the
cloud
providers
like
Amazon
Yahoo
Google
has
done
an
incredibly
good
job
of
making.
You
think
that
cloud
computing
is
free,
it's
only
10
cents
per
core
hour.
A
So
they
do
overestimate
the
cloud
costs,
but
they
also
underestimate
the
cloud
costs
in
many
different
ways.
So,
as
I
said,
it
doesn't
measure
the
slowdown
and
it
doesn't
take
into
account
that
you
don't
get
any
consulting
on
the
in
the
cloud
for
that
or
scientific
computing
experts.
There's
no
real
account
management,
there's
no
software
support
and
all
of
those
things
are
about
a
third
of
nernst
budget
and
and
further.
You
know.
Why
is
this
true?
A
So
so
you
know:
why
is
it
that
Google
can't
provide
computing
more
efficiently
than
nurse
can
because
they
have
a
larger
scale
of
computing?
Then
the
nurse
and
the
idea
is
economies
of
scale,
and
the
answer
is
they
probably
can
they
can
actually
buy
computing
infrastructure
at
slightly
less
than
nurse?
Although
nurse
cos
pretty
far
up
the
efficiency
curve
right,
we're
also
already
buying
a
very
large
scale
systems
buying
very
large
quantities
of
power.
A
The
power
here
at
the
hill
is
very,
is
actually
pretty
green
and
also
very
inexpensive
relative
to
what
you
would
pay
and
say
a
traditional
commercial
setting.
So
nurse
has
many
of
the
benefits
of
cloud
computing
at
scale,
but
we
run
at
much
higher
utilization,
so
over
90
percent
utilization,
whereas
most
the
cloud
facilities
are
struggling
to
get
over
about
sixty
percent
utilization.
Many
of
them
run
much
lower
than
that
and
the
curse
the
cost
per
core
hour.
A
You
know
what
from
when
I
started
at
the
end
of
2007,
actually
technically
as
January
of
2008,
but
in
2007,
so
that
was
when
Franklin
was
installed.
It
was
in
October
of
2007
until
we
installed
hopper
and
latias
last
year,
when
we've
had
both
hopper
and
Franklin
running
for
a
while
that
the
number
of
core
hours
went
up
by
a
factor
of
10
in
that
in
that
four
year
period.
In
that
same
period
of
time,
the
cost
of
buying
a
core
hour
at
Google
or
at
Amazon
sorry
and
their
ec2
cloud
dropped
by
15%.
A
A
It
says
that
the
main
problem
we
have
in
getting
to
exascale
is
about
performance
Ansari's
about
power
and
how
to
make
machines
how
to
make
it
possible
to
actually
build
a
machine
that
you
can
afford
to
turn
on,
because
if
you
just
look
at
Moore's
Law
scaling
you'd
have
about
200
million
dollars
in
in
power
cost
just
to
run
just
to
pay
for
the
power
bill
at
risk.
So
I'm
now
going
to
sort
of
switch
and
talk
about.
A
You
know
what
I
think
all
of
you
who
are
writing
codes
and
worrying
about
the
next
generation
of
architectures
and
what
these
systems
will
look
like
should
be
thinking
about
in
terms
of
the
future
of
these
these
codes
and
what
are
the
problems?
And
the
first
problem
is
that
communication
is
very
expensive,
it's
expensive
both
in
time
and
that's
the
little
table
up
there
in
the
upper
right.
It's
those
are
the
the
annual
improvements
in
14
point
operations
per
second,
which
is
59
percent
in
bandwidth
and
in
latency.
A
Now
you
say:
butBut
flops
stop
getting
faster
right
in
2004.
You
just
told
me
that,
but
this
is
the
throughput
rate
of
a
single
chip,
has
continued
to
go
up
roughly
by
59
percent
a
year.
It
slowed
down
a
little
bit,
but-
and
we
are
going
to
have
a
problem
in
the
next
10
years
or
so
when
we're
gonna
start
we're
going
to
start
running
out
of
transistors
scaling
as
well.
A
But
the
the
bottom
graph
then
looks
at
the
amount
of
energy
that's
used
to
do
different
operations
within
the
computer,
so
this
is
in
Pico
joules
and
if
you're
doing,
arithmetic
you're
there
today
at
around
100,
Pico
joules
and
projecting
forward
at
more
like
20,
Pico
joules
accessing
something
and
a
register
is
significantly
less
energy.
But
as
soon
as
you
go
off
chip
even
to
local
dram
memory,
you're
up
to
one
to
two
orders
of
magnitude
more
in
terms
of
the
energy
consumption.
A
So
given
that
the
problem
of
exascale
is
really
about
saving
energy,
we
need
to
minimize
the
amount
of
data
movement.
So
we
also
have
to
be
careful
to
separate
bandwidth
problems,
which
is
the
number
of
words
being
moved
and
from
latency
problems,
which
is
the
number
of
separate
messages
that
are
being
moved
and,
and
these
things
are
hard
to
change.
The
latency
problems
are
really
about
physics
right.
You
can't
get
any
better
than
the
speed
of
light
across
the
machine.
A
Room
and
bandwidth
is
about
money
which
Sudeep
now
understands
very
well
and
I'm
sure
he
did
before
as
well.
But
you
know
this
is
really
about
I
mean
when
you
go
and
talk
to
the
vendors
in
a
negotiation
you
say:
well,
we
want
twice
as
much
bisection
bandwidth
and
they
say:
okay.
Well,
you
know,
that'll
cost
you
substantially
more
money
in
order
to
get
that
much
bandwidth
and
if
there's
only
a
small
part
of
the
workload
that
can
benefit
from
it,
it
may
not
may
not
make
sense.
A
Besides
that,
there's
a
point
of
diminishing
returns
once
you've
spent
90%
of
your
budget
on
memory,
bandwidth
and
network
bandwidth,
there
isn't
much
left
in
order
to
take
away
from
computing
in
order
to
put
computing
into
the
bandwidth
machine.
So
the
strategies
are
slightly
different
for
these
different
different
cost
components
of
communication.
When
you,
when
it
comes
to
latency,
you
can
try
to
overlap
it
right.
A
You
can
hide
it
by
doing
other
things
on
the
computer,
it
doesn't
make
the
latency
go
away,
but
it
does
make
it
less
painful
and
less
expensive
in
your
at
the
algorithmic
level,
whereas
in
bandwidth
the
only
way
you
the
only
thing
you
can
do,
it's
more
fundamental,
usually
in
the
algorithms.
The
only
thing
you
can
do
is
come
up
with
new
algorithms
that
that
don't
send
so
much
data.
A
The
gap
between
bandwidth
and
computational
capability
and
a
single
chip
has
continued
to
grow,
and
but
the
way
to
think
about
this
is
not
as
a
wall
but
as
a
swamp,
and
you
walked
in
walking
to
that
swamp
for
years
and
we're
going
to
continue
walking
into
it,
because
the
amount
of
the
number
of
14-point
units,
the
amount
of
arithmetic
performance
that
you
can
put
on
a
single
processor
chip,
is
going
to
continue
to
grow.
Much
faster
than
bandwidth
is
going
to
grow
and
there
are
technologies
that
we
are
looking
at.
A
That
Delia
is
looking
at
in
terms
of
optical.
You
know
on
chip,
silicon
photonics
in
the
longer
term
and
the
short
term
trying
to
figure
out
memory
technologies
such
as
stacking
that
will
hopefully
make
this
bandwidth
gap
a
little
bit
better,
but
fundamentally
the
the
this.
This
is
still
going
to
be
a
problem,
so
you
know
that
this
is
slight,
as
maybe
start
starting
to
get
old.
A
Now
the
election
is
long
over,
but
you
know
Obama
actually
understands
this
problem,
and
the
president's
FY
12
budget
said
that
you
know
one
of
the
things
that
de
needed
to
do
was
to
minimize
the
communication
between
processors
and
the
memory
hierarchy
by
reformulating
the
communication
patterns
specified
within
the
algorithm.
So
now
you
have
to
be
a
little
bit
careful
about
taking
lessons
that
you
learned
in
in
your
scientific
work
and
applying
them
at
home
or
employing
them
in
another.
A
Setting
like
in
the
debate
in
Denver
I,
think
that
Obama
might
have
taken
communication
avoidance
a
little
bit
too
seriously.
So,
let's
say
the
so.
They
have
a
few
lessons
now
for
all
of
you
who
are
writing
scientific
software,
designing
algorithms
or
supervising
people
who
are
the
first
one
has
to
really
understand
the
communication
limits
and
for
this
I'd
like
to
use
Sam
Williams
roofline
model,
how
many
people
are
familiar
with
the
roofline
model:
okay,
yeah
hold
the
co-op's
cursive
sales
papers
and
the
locus
of
local
people.
No,
this
is
so.
A
This
is
a
nice
way
to
think
about
the
the
fundamental
limit
of
bandwidth
in
your
systems,
and
you
can
apply
it
I'll
talk
here
about
what
it
looks
like
in
in
between
the
memory,
the
DRAM
of
a
single
processor
and
the
processing
chip.
Although
you
can
apply
this
to
other
parts
of
the
memory
hierarchy
and
it's
a
it's
a
very
simple
model,
and
it's
actually
what
I
think
people
were
using
intuitively
when
they're
trying
to
optimize
codes
to
minimize
bandwidth,
but
it's
it
just
kind
of
captures
it
in
a
nice
picture.
A
So
what
is
this
picture
like?
First
of
all,
it's
important
to
realize
it's
a
log-log
scale.
So
what
is
on
the
x-axis
here
is
the
is
a
property
of
the
algorithm,
which
is
the
computational
intensity,
that
is
the
number
of
floating-point
operations
per
byte,
moved
from
the
memory
into
the
processor
chip
versus
the
number
of
Mount
of
computation.
You
do
you
can
do
inside
the
processor
chip
and
the
y
axis
is
the
attainable,
gigaflop
rate
that
you
can
get
for
that
for
that
code.
Now,
what?
Why
is
it
called
the
roofline?
A
Well,
the
top
the
flat
part
of
the
roof
is
the
peak
floating-point
performance
of
the
hardware.
So
the
other
lines
on
here
all
about
hardware
characteristics
so
they're
fixed
for
the
hardware.
The
basic
plot
is
fixed
for
a
particular
processor
and
you
start
with
the
top
line,
which
is
double
precision,
40-point
peak
performance.
Those
are
the
things
that
the
benders
always
tell
you
this,
how
fast
my
processor
goes
and
but
then,
if
you
don't
actually
use
fuse,
multiply,
add
instructions
on
a
lot
of
processors.
A
You
drop
down
by
a
factor
of
two
remember
it's
a
log
scale
and
if
you
don't
use
Sindhi
operations,
you
might
drop
by
another
factor
of
two
and
if
you
don't
use
in
structural
level,
parallelism
that
is
careful
scheduling
of
your
instructions.
You'll
drop
down
by
another
factor
of
2,
that's
the
without
ILP
line.
So
this
gives
you
a
sense
of
how
fast
you
should
be
going
in
terms
of
the
floating-point
performance
of
the
processor.
A
Now
what
the
the
diagonal
line
is,
maybe
a
little
bit
harder
to
kind
of
get
an
intuition
about,
but
it
is
just.
It
is
the
the
bandwidth
between
the
memory
and
the
processor,
and
you
start
with
a
peak
bandwidth.
So
that's
they
kind
of
guaranteed
not
to
exceed
number.
But
if
you're,
not
using
software
prefetch
on
a
lot
of
memory
systems,
you
actually
won't
get
that
peak
performance
you
might
drop
by
another
factor
of
2
or
so,
if
you're,
not
using
the
Numa
architecture.
As
a
structure
of
the
architecture
like
on
on
hopper.
A
Many
of
you
who
have
done
kind
of
careful
optimization
of
the
the
node
code
on
hopper
know
that
the
Numa
structure
there
is
very
important
than
you
might
drop
by
another
factor
of
2.
So
if
you
kind
of
cancel
out-
but
you
know
bytes
per
second-
the
bandwidth
numbers
there
and
put
this
on
the
graph.
You
end
up
with
these
diagonal
lines,
and
so
that
also
limits
your
performance.
A
So
your
goal,
in
optimizing
code
of
course,
is
to
try
to
move
your
code
over
into
higher
computational
intensity,
which
all
of
you
knew
before
I
told
you
about
the
roofline
model.
But
this
gives
you
a
little
bit
more
concrete
kind
of
picture
of
what
the
limits
are.
That
system,
and
so
some
work
so
I
think
there's
some
of
Stefan's
results
here.
The
GTC
things
with
other
people
at
the
lab
whole
bunch
of
people
have
worked
on
each
one
of
these
points.
A
This
is
optimizing
a
number
of
different
computational
kernels
for
two
different
architectures
until
Nehalem
and
the
nvidia
fermi
system,
and
what
you
can
see
is
that
the
the
performance
of
these
is
all
over
the
place,
but
that
you
are,
you
know
roughly
they
do
how
to
match
the
roofline.
That
is,
if
you
work
really
hard
at
optimizing
these
codes,
you
can
get
them
to
be
bandwidth.
A
Limited
I,
don't
have
that
it
didn't
bring
this
particular
graph,
but
it
does
also
the
case
that
if
you
take
an
average
nurse
spent
an
average
or
nurse
cap
location-
and
you
run
it
on
the
Sun
on
the
system-
it
often
is
not
pegging
the
memory
bandwidth.
So
it's
important
to
realize
that
yes,
these
problems-
indeed,
as
you
would
expect,
is
a
sparse
matrix
vector
multiply,
is
indeed
memory.
A
Bandwidth,
limited
stencil
operations
typically
are
bandwidth
limited,
but
many
times
the
actual
code
that
people
are
running
is
not
so
there's
often
a
fair
amount
of
headroom
in
there
and
understanding
this
roofline
model
can
help.
You
number
two
is
understand
that,
in
order
to
get
better
better
bandwidth
utilization,
you
need
to
do
higher
level
optimizations.
So
you
know
that
previous
slide
was
all
about
optimizing.
A
Now,
if
you,
if
you're,
if
you're
sitting
here
at
the
roofline,
what
can
you
do?
And
this
is
the
position
we
were
in
a
few
years
ago
and
in
this
bebop
project?
We
were
looking
at
these
first
matrix
vector,
multiply
things
we
said.
Well,
you
know
what
do
we
do?
We
can't
we
can't
beat
the
bandwidth
on
the
machine,
so
the
answer
is
well.
You
look
at
a
higher
level
kernel
and
see
if
you
can
avoid
bandwidth
by
optimizing
at
a
higher
level.
So
the
example
of
that
is
something
like
sparse
matrix
vector
multiply.
A
Let
me
just
let's
see,
go
to
my
other
picture
here
for
a
minute
that
one
which
says
in
a
sparse,
matrix,
vector
multiply.
What
you
want
to
do
is
do
a
you,
need
to
read
the
matrix
and
then
do
a
matrix
vector
multiply.
You
know
so
multiply
each
one
of
those
entries
in
the
matrix
and
what
was
really
discouraging
is
when
we
looked
at
the
performance
of
this
I
mean
this
is
another
way
of
saying
that
it
was
sitting
on
the
roof
line
is
basically
the
amount
of
time
to
do.
A
A
matrix,
vector,
multiply,
sparse
matrix
multiply
is
limited
by
the
time
to
read
the
matrix,
so
you
can't
do
anything.
You've
got
to
read
the
matrix
right,
so
the
question
is:
can
you
read
the
matrix
once
and
take
multiple
iterative
steps,
because
you
are
in
an
iterative
solver
reading
that
matrix
over
and
over
again?
So
the
idea
is
we'll
pick
up
a
little
piece
of
the
matrix
and,
of
course
a
sparse
matrix
is
really
just
an
unstructured.
Graphs
will
pick
up
a
little
piece
of
our
unstructured
graph,
we're
doing
nearest
neighbor
computation.
A
That's
a
SP
M
V
operation
on
it.
So
in
order
to
do
the
update
on
that
vector
which
are
the
nodes
in
that
graph,
we
need
to
get
a
slightly
larger
region.
So
we
need
to
get
the
neighboring
points
there.
So
we
can
compute
the
next
value
of
the
interior
of
the
graph
and
we
need
to.
If
we
want
to
do
two
steps
with
one
read
of
the
matrix,
then
we
need
to
get
a
slightly
bigger
piece
of
it.
So
that
means
all
the
edges
as
well
and
three
steps
and
so
on.
A
So
so
we
actually
did
this
and
you
can
actually
you
can
actually
make
sparse
matrix
vector,
multiply,
go
much
faster
if
you
do
them.
K
steps
at
a
time
that
is,
you,
do
a
to
the
K
times
X,
rather
than
doing
a
times
X
but
you're.
Now
you
now
have
a
higher-level
computation,
so
this
says
a
to
the
K
times.
A
It's
going
to
have
that
eight
of
a
K
thing
in
there.
You
can
see
that
the
W
equals
that
vector
there.
So
we
stick
in
our
a
to
the
K
kernel,
there's
some
other
stuff
going
on
over
the
reductions
that
I
won't
talk
about
right
now
that
also
have
to
do
with
communication,
but
so
to
a
compiler
person.
This
looks
like
kind
of
a
loop
interchange
idea.
A
We
have
the
the
K
loop
on
the
outermost
that
there
and
we're
going
to
just
take
that
K
loop
and
kind
of
stick,
some
of
it
on
the
inside
there
you
know,
and
once
we
read
the
matrix,
then
we'll
do
K
steps,
except
that
it's
completely
illegal
to
compiler
transformation.
We've
completely
changed
the
dependencies
in
the
program
you
no
longer
get
the
right
answer
and-
and
maybe
it's
sort
of
still
smells
like
GM
res,
but
unfortunately
doesn't
behave
like
Chan
res.
So
this
is
how
GM
res
behaves
in
terms
of
its
residual
error.
A
So
it's
an
iterative
solver,
so
you
want
it
to
get.
You
want
the
error
to
go
away
as
you
go
through
the
iteration
counts
of
the
solver,
so
this
is
not
performance.
This
is
an
error
that
we're
measuring
here.
This
is
what
happens
when
you
take
the
new
communication,
avoiding
algorithm
that
uses
the
a
to
the
K
kernel.
That
is,
does
one
read
of
the
matrix
for
every
K
steps,
so
it
runs
faster,
but
it
no
longer
converges.
So
not.
A
A
very
useful
algorithm,
however,
turns
out
that
if
you
use
a
different
basis
called
a
Newton
Newton
basis,
you
can
get
convergence
back
again
and
you
still
are
using
something
that
kind
of
is
like
a
this.
This
a
decay
to
the
K
kernel
in
there,
so
lots
of
hand
waving
underneath
this.
The
high-level
point
is
that
you
shouldn't
be
just
optimizing.
The
inner
most
loops
of
your
code,
you
need
to
think
about
well,
could
I
rearrange
something
at
a
much
higher
level?
A
That
would
allow
me
to
do
less
communication,
less
data
movement
and
by
the
way
you
can
put
this
all
back
together
again
and
it
actually
does
run
faster
to
use
the
communication
avoiding
part.
Those
are
the
the
orange
and
red
bars
they're
all
set
to
one
there.
So
it's
normalized
to
this,
but
the
the
faster
version
and
that's
the
the
slowdown
of
the
original
version.
A
A
What
we're
actually
not
trying
to
generalize
it
to
arbitrary
loop
nests,
but
the
the
basic
idea
in
which
I
think
a
lot
of
people
go
into
a
scientific
computation
is
you've,
got
a
physical
domain,
we're
going
to
chop
up
the
physical
domain,
we'll
give
each
processor
a
piece
of
that
physical
domain
and
they
will
be
responsible
for
the
updates
on
it.
That
makes
by
the
way
the
concurrency
control
problems
really
easy.
You
don't
really
have
to
worry
about
two
processors
updating
the
same
value,
so
it's
a
nice
way
to
organize
your
code.
A
So
this
is
some
performance
analysis
done
with
a
new
algorithm
that
doesn't
just
do
domain
get
into
composition.
It
actually
makes
multiple
copies
of
things
in
matrix
multiply,
so
this
is
edgar
Solomonic
who's,
a
grad
student
on
campus
working
with
Jim
Donnell,
his
advisor
and
doing
matrix
multiply,
and
this
is
running
on
Blue
Gene
P,
and
this
is
a
speed.
This
is
running
time,
so
this
is
the
old
algorithm,
and
this
is
the
new
algorithm.
A
The
new
algorithm
for
reason,
I'll,
explain
in
a
minute,
is
called
a
2.5,
D
algorithm,
and
so
it's
not
just
using
this
idea
of
chopping
up
the
result
matrix
the
C
matrix,
if
you
own
C,
equals
a
times
B
in
two
separate
pieces,
but
actually
doing
something
more
complicated.
There's
a
different
problem
size
with
a
speed-up
shown.
So
alright,
so
I
wasn't
involved
in
this.
But
I
was
watching
this
work
and
I
said
well.
What
was
I
surprised
about?
A
First
of
all,
the
prize
that
any
committee
could
make
matrix
multiply,
go
any
faster
from
an
algorithmic
standpoint.
I
mean
I
know
there
are
people
working
on
you
know,
making
the
exponent
a
little
bit
lower
in
terms
of
Dawsons
algorithms
and
things
like
that.
But
this
is
basic
and
order.
N
cube
matrix
multiply,
wasn't
really
changing
the
computation
in
any
significant
way.
A
It
was
just
changing
the
data
movement,
so
you
can
make
it
go
faster
and
the
basic
idea
was
to
make
copies
of
the
C
matrix,
have
different
subsets
of
processors,
updating
those
copies
independently
and
then
combine
the
results
together
at
the
end.
So
the
lesson
that
is
so
and
there's
you
know
there's
a
nice
theory
behind
us.
This
is
provably
optimal
you!
The
lesson.
Was
you
never
waste
fast
memories?
A
So,
if
you've
got
now,
you
may
be
concerned
that
Edison
or
nurse
aid
of
the
future
systems
are
not
going
to
have
enough
memory
per
core,
which
is
which
is
always
going
to
be
a
concern,
but
they.
But
if
at
some
point
in
the
middle
of
the
computation,
you
have
a
computation
that
is
not
using
all
the
fast
memory
on
those
systems.
A
You
want
you
to
consider
doing
something
that
decomposes
into
finer
grained
parallelism
and
then
makes
use
of
all
of
that
memory
in
order
to
get
speed-up
and
you're
doing
it
not
to
get
more
parallelism,
you're
doing
it
to
reduce
communication.
So
now
the
question
is:
can
we
take
this
isjust
matrix
multiply?
Can
we
do
something
for
everybody
else
who
are
running
other
computations
in
the
world,
and
so
this
is
looking
at
what
matrix
multiple
actually
looks
like
in
an
iteration
spaces,
which
is
the
way
compiler
writers
think
about
it.
A
So
there's
three
loops
right:
I,
J
and
K
and
there's
our
iteration
space,
and
you
can
actually
then
think
about
where
the
matrices
a
B
and
C
fit,
because
there
are
projections
of
that
iteration
space
onto
the
surfaces,
so
the
C
matrix
is
the
top
and
the
bottom.
The
a
matrix
is
the
front
in
the
back
and
the
C.
The
B
matrix
are
the
two
sides:
okay.
So
at
every
point,
in
the
middle
of
that
iteration,
space
you're
going
to
do,
multiply
and
add
which
is
updating
a
value
from
each
of
the
three
spaces.
A
The
three
three
spaces
that
I've
shown
here
that
are
colored.
And
so
you
need
to
pick
up
those
elements.
And
so
the
question
is
how
do
I
divide
up
that
iteration
space
in
order
to
minimize
the
amount
of
surface
area
that
gets
touched
by
projecting
out
that
interior
region?
So
you
can
imagine,
say
the
way
the
proof
goes
to
say
will
pick
up
an
arbitrary
glob
of
stuff
in
the
middle
of
this
cube
and
figure
out
what
its
projection
is
and
what
is
the
smallest
projection
and
not
surprisingly,
the
smallest
projection
is
actually
a
cube.
A
It's
just
chops
it
in
two
dimensions
and
this
one
is
actually
chopping
in
the
third
dimension.
So
it
could
be
called
a
three
D
algorithm,
but
it's
kind
of
there
for
technical
reasons,
there's
a
an
extreme
case
where
the
third
dimension
is
really
big
as
big
as
possible.
That's
called
the
three
D
algorithm,
so
this
is
called
the
2.5
D
algorithms,
because
it's
because
it's
somewhere
in
between
okay,
so
you
may
not
care
about
matrix
multiply.
You
may
care
about
other
things,
so
the
question
is:
can
we
apply
this
to
other
things?
A
Actually,
some
of
my
students
that
I
actually
do
have
students
again
we're
new
students
who
are
working
on
some.
Some
of
these
ideas
have
figured
out
that
you
can
apply
it.
We've
figured
out
what
that
you
can
apply
this
to
end
body
codes,
so
just
to
give
you
a
hint
what
the
idea
is.
If
we've
got
what
we'll
just
do
a
really
stupid
end
body
code
here
for
purpose
of
illustration
and
because
it's
a
lot
easier
to
analyze,
so
you've
got
order
and
particles
and
you've
got
P
processors.
A
So
it's
order
n
words,
so
it
turns
out
you
can
use
the
same
replication
idea,
so
we're
kind
of
replicating
all
the
particles
a
few
times
and
then,
within
this
smaller
group
of
processors,
sending
all
the
particles
around
so
that
everybody
can
do
a
subset
of
the
updates.
So,
for
example,
the
first
row
is
responsible
for
all
the
pink
updates.
The
second
row
for
the
green
updates,
the
third
row
for
the
yellow
updates
and
so
on,
and
you
can
actually
prove
that
you
get
better
performance
out
of
it
yeah.
A
But
you
know
I
like
this
quote
in
theory,
there
is
no
difference
between
theory
and
practice,
but
in
practice
there
is,
and
so
what
that
you
know
it
is
this
just
a
theoretical
result.
Well,
the
answer
is
no.
You
can
actually
get
speed-up
numbers
from
this
as
well,
just
as
you
could
with
matrix
multiply.
So
it's
important
to
think
about.
You
know
how
you
might
paralyze
your
codes
in
ways
that
will
reduce
the
amount
of
traffic
by
looking
at
higher
level
kernels
and
in
this
case,
by
thinking
about
other
ways
than
just
decomposing.
A
The
data
structures
into
independent
pieces
and
the
other
way
to
think
about
what's
going
on
in
both
a
matrix
multiply
case
and
the
n-body
case
is
kind
of
a
replicate
and
then
reduce
you've
got
you've,
got
make
replicas
of
your
data
structures,
you're,
independently
working
on
partial
results,
and
then
you
reduce
at
the
end
to
get
the
the
full
answer.
Okay,
so
have
we
seen
this
before?
A
Yes,
in
fact,
when
I've
talked
about
these
algorithms
and
some
audience,
they
say,
but
we
use
that
much,
that
algorithm
for
matrix
multiply
and
the
connection
machine,
the
CM
2.
So
for
those
of
you
who
they're
not
remedy
in
this
room,
who
are
old
enough
to
remember
the
CM
2
and
the
mass
par
machine,
those
were
machines
with
little
teeny,
tiny
processors
and
people
did
indeed
use
these
kinds
of
algorithmic
ideas
because
they
they
needed
so
much
parallelism
that
it
wasn't.
A
A
The
basic
idea
is:
if
you
need
to
have
a
bunch
of
processors,
updating
something
simultaneously
rather
than
worrying,
about
locking
over
it
make
a
copy
of
the
thing
that
you're
updating
everybody
updates
independently.
And
then
you
combine
the
results
together
at
the
end
and
it
gets
used
in
sem
d,
extensions
and
GPUs
and
so
on.
A
Ok,
so
you
all
know
about
making
messages
large
or
any
good
MPI
programmer
knows
that
you
want
to
send
a
small
number
of
messages,
because
each
message
is
very
expensive,
but
the
opposite
of
that
is
really
that
you
want
to
also
overlap
and
pipeline
your
communication.
So
this
sometimes
runs
contrary
because
in
order
to
overlap,
communication
and
pipeline,
so
pipeline
means
overlapping
communication
with
communication,
whereas
but
also
overlap,
just
means
overlapping
it
with
computation.
A
You
want
to
start
the
communication
as
soon
as
possible,
which
often
means
you've
done
you're,
not
ready
to
do
all
of
the
communication
at
once.
So
you
start
what
you
can
I
mean
you
end
up
sending
more
messages
in
the
end,
so
you
know
this
is
what
the
P
gasp
ideas
are
all
about
and
is
about
really
making
it
easy
to
do
overlap,
and
it's
really
about
DMA
operations
that
is
doing
fine-grained
very
lightweight
communication
across
a
global
address
space.
So
so
this
is.
A
This
is
what
these,
the
P
gas
languages
like
UBC
and
Co,
a
Fortran
chapel
and
so
on.
Look
like,
which
is
every
processor,
has
a
chunk
of
the
memory
which
is
physically
what
you
have
in
the
system,
but
they
can
access
data
anywhere
in
the
system.
Simply
by
doing
a
read
and
write,
they
don't
have
to
ask
the
other
processor
to
help
them
do
the
communication.
So
I
also
think
of
this
as
we're
having
to
say
receive,
and
so
in
this
model
you
it
is
it.
A
It
turns
out
that
this
is
closer
to
what
the
hardware
actually
does,
because
down
inside
of
a
MPI
send
and
receive
there's
typically
a
DMA
operation
going
on
there,
and
why
do
these?
These
kinds
of
programming
models
come
up.
These
global
address
space
models
especially
come
up
when
you
have
a
very
irregular
sort
of
data
set.
So
imagine
that
you
want
to
compute
a
histogram
on
a
huge
data
set
that
does
not
fit
in
the
memory
of
a
single
processor
or
even
on
the
biggest
shared
memory
multiprocessor.
A
You
can
find
so
you
what
you
do
is
you
take
your
machine
like
hopper
and
you
spread
the
histogram
over
all
the
processors
and
now
what
you're
going
to
have
as
you're
computing?
This
histogram
is
you've,
got
these
keys
coming
in
and
you
need
to
put
them
in
a
bucket
in
the
histogram
and
those
are
going
to
be
kind
of
random
accesses
into
the
middle
of
the
machines
memory.
A
And
so
that's
what
these
global
address
space
programming
models
are
about
is
making
these
kinds
of
things
easier
to
express
and
actually
faster
to
execute
in
in
general,
whereas
the
MPI
stuff,
if
you're,
really
working
on
a
physical
simulation
problem,
it
is
often
easier
to
divide
up
your
domain
physically.
Even
if
you're
using
the
replica,
the
replication
idea,
you've
got
some
much
more
structure
to
work
with,
and
so
you
you
use
that.
A
So
that's
why
the
MPI
codes
in
practice
have
been
that
has
actually
worked
out,
it's
very
painful,
to
write
to
program
a
histogram
in
MPI,
because
you
don't
know
when
to
say
receive
right
if
you're
the
processor
that
owns
the
bucket
that
somebody
else,
some
other
processors
inserting
a
key
into.
It's,
not
a
very
natural
thing,
to
figure
out
how
to
say
receive
in
that
kind
of
a
model.
So
I
think
I'll
skip
some
of
this,
except
to
say
that
you
know
there.
A
This
is
some
work,
I'm
done
on
the
milk
application
in
QCD,
and
this
is
a
hung,
Jiang
Shan
and
a
bunch
of
others
who
looked
at
the
comparison
between
a
UPC
implementation
and
an
MPI
implementation
I'm
going
out
to
32,000
cores
now.
This
is
looking
at
a
slightly
different
version
of
the
algorithm.
So
whenever
you
compare
across
programming
models,
you
have
this
problem
that
you
you
write
something
in
a
different
way,
sometimes
in
a
different
language,
and-
and
in
this
case
that's
indeed
what
happens.
A
Is
you
get
a
different
version
of
the
algorithm,
but
you
do
get
as
you
can
see
much
better
scaling
of
the
performance
of
QCD,
okay,
so
I
think
I
will
kind
of
wrap
up
and
just
say
there
are
a
lot
of
challenges
that
we're
facing
in
the
next
generation
of
scientific
computing.
Scaling
is
the
most
obvious,
but
excess
scale
is
really
not
about
scaling.
A
Excess
scale
is
about
figuring
out
how
to
use
more
energy,
how
to
design
and
use
and
program
more
energy,
efficient
processors,
and
it
is
also
about
synchronization
the
dynamic
system
behavior
that
we're
going
to
see
I'll.
You
see
this
many
of
you
who
run
who
run
very
large-scale
simulations
on
hopper,
do
see.
The
fact
effective
me
was
telling
you
about
this.
A
But
you
know,
but
what's
really
important
is
still
location,
location,
location
and
all
of
the
things
that
do
communication
near
your
code,
whether
it's
communicating
up
and
down
between
the
processor
and
the
memory
or
communicating
between
the
processors,
continues
to
be
a
really
important
characteristic
of
how
you
optimize
code.
So
in
conclusion,
communication
hurts
and
so
be
careful
and
try
to
minimize
the
amount
of
communication
you
do
thanks.