►
From YouTube: 07 Putting it all together in real codes
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
hopefully,
at
this
stage
we
all
have
a
clear
understanding
of
the
systematic
and
predictable
way
that
kodi
proposes
to
optimize
the
performance
of
our
software
from
very
simple
codes
like
pi,
to
something
more
complicated
at
matmul,
even
to
detect
effects.
So,
but
we
know
that
when
we
have
a
given
piece
of
our
code
that
fulfills
the
properties
of
being
a
typical
kernel
that
we
want
to
optimize,
then
we
can
use
everything
we
have
seen
so
far.
A
But
in
order
to
for
real
applications
to
get
at
this
stage,
where
we
can
unlock
these
capabilities,
we
need
to
essentially
address
the
complexity
of
real
software.
So
here
in
this
putting
it
all
together,
what
we
want
is
to
just
enumerate
highlight
raise
awareness
around
some
of
the
topics
that
make
real
applications
more
complicated
to
from
the
performance,
optimization
perspective.
A
So
for
that
purpose
we
are
proposing
relation
k.
This
is
a
simplification
of
the
lulac
coral
benchmark
that
we
already
used
in
past
editions
of
training
sessions
at
nerds,
and
now
we
have
essentially
reviewed
all
that
material
and
how
to
approach
the
optimization
of
fluence
k
using
this
new
predictable
pathway
that
we
have
in
kodi
today.
A
So
essentially
in
relation
k,
we
have
selected
functions
and
pieces
of
code
that
are
exactly
the
same
pieces
of
code
that
you
can
find
in
the
real
lulay.
Coral
benchmark
and
others
have
been
simplified
just
for
the
sake
of
teaching
just
to
enable
focusing
on
specific
challenges
from
the
programming
point
of
view
and
simplify
a
bit
the
complexity
of
real
applications,
but
still
be
able
to
demonstrate
which
are
or
illustrate
which
are
some
of
these
challenges.
A
So
here
you
can
see
in
this
table
that
in
this
course
we
have
addressed
how
to
find
opportunities
for
offloading
not
easy.
Many
times
we
need
support
of
tooling
to
do
that
better
and
faster.
How
to
optimize
data
layout
for
data
transfers
and
also
how
to
how
the
data
transfers
and
potential
defects
we
can
introduce
in
our
code
are
related
to
the
way
we
represent.
The
data
in
our
codes,
so
it
is
different,
a
logical
interpretation
of
a
matrix
than
the
real
implementation
in
in
our
course,
but
real
codes
are
have
more
than
that.
A
We
typically
have
multi-dimensional
arrays,
which
have
deep
nested
loops.
Can
we
exploit
faster,
computational,
more
computational
power
and
gpus
by
paralyzing,
all
together,
at
the
same
time
different
nested
loops?
This
is
what
we
typically
address
when
we
say
the
challenge
of
exploiting
massive
parallelism
through
loop
nest
collapsing.
We
will
try
to
address
that
in
the
following
course.
A
Another
thing
is:
we
have
been
seeing
how
to
surround
a
given
kernel
by
data
transfers,
but
what
happens
if
this
kernel
is
not
executed
once
it
is
executed
in
a
simulation
loop
repeated
many
times,
we
need
to
understand
the
wider
picture
of
the
application
to
see
how
to
minimize
data
transfers.
Maybe
some
input
data
read
at
the
beginning
of
the
program
and
only
need
to
be
transferred
once
not
one
time
per
each
iteration
of
the
simulation
loop.
A
This
is
what
we
call
here:
minimize
the
transfers
through
convergence
loops,
and
sometimes
we
don't
have
one
simple
loop.
We
have
several
loops
chains,
one
after
another.
Data
transfer
should
also
be
minimized
and
group
several
loop
nests
within
one
single
data
transfer.
So
this
is
minimizing
data
transfers
across
consecutive
loop
nests
and
for
this
session
we
have
selected
the
last
one
that
is
in
real
applications.
We
typically
don't
have
a
loop
nest
that
is
self-contained.
A
These
loops.
We
typically
call
another
function,
calls
that
can
call
more
function
calls
so
we
can
have
nested
function,
calls
in
our
applications,
and
this
is
something
that
we
have
arises,
not
naturally
in
real
applications.
So
this
is
the
challenge
that
we
have
identified
called
here.
Identify
auxiliary
functions
to
be
offloaded,
okay,
so.
A
Before
telling
you
or
showing
you
what
additional
materials
that
we
will
not
be
viewing
in
this
course,
but
just
point
you
to
all
these
materials
are
available.
Remember
again,
we
are
responsible
for,
according
to
usage
of
the
api,
openmp
open
cc
of
the
programming
language
and,
of
course
we
have
the
support
of
the
compiler.
We
have
the
support
of
additional
tools
like
kodi,
but
in
the
end
we
developers
must
guarantee
correctness
and
check
that
the
performance
increases
so
in
real
applications.
Apart
from
what
we
said
here,
we
have
just
enumerated
some
of
the
challenges.
A
Some
challenges
have
to
deal
with
the
platform
or
the
or
the
complexity
of
real
applications.
You
need
to
deal
with
not
only
c,
not
only
fortran,
but
a
mixture
of
cc,
plus
for
some
programs
or
routines
that
need
to
interact
between
them.
You
typically
need
to
use
different
compilers.
You
want
your
application
running
in
summit
running
in
per
motor
running
or
an
amd
a
system
like
crusher
different
compilers,
different
runtimes,
different
programming
environments.
A
We
want
to
handle
this
complexity
as
well
build
systems.
Simple
codes
can
be
compiled
with
a
single
invocation
of
the
compiler,
but
real
applications
composed
of
thousands
of
directories
and
source
code
files
need
build
systems.
We
need
more
portable
and
maintainable
way
of
building
our
software
by
components,
and
this
is
where
make
files
or
cmake
tools
come
into
play,
so
we
need
to
find
a
way
to
interact
with
these
build
systems
as
well
and,
of
course,
operating
systems
is
different
to
deploy
and
develop
in
linux.
A
We
need
some
kind
of
methodology
to
benchmark
correctly
measure
several
measures
discard
the
extremes,
non-relevant
values
and
make
the
average
of
of
that
and
always
guarantee
correctness.
So,
in
the
end,
all
of
these
things
are
topics
that
we
typically
don't.
We
don't
cover
in
this
kind
of
courses,
but
in
real
applications.
You
need
to
consider
which
are
the
constraints
and
the
requirements
of
your
project.
A
So,
in
the
end,
you
can
be
optimizing
performance,
but
also
with
trade-offs,
performance
versus
maintainability
performance
versus
portability
across
applications,
performance
versus
something
so
developing
real
applications
with
quality
high
quality
is
not
easy,
and
sometimes
we
need
to
make
trade-offs
between
one
one
objective
like
performance
that
can
be
contradictory
or
opposite
to
another,
one
like
portability
or
readability,
or
make
or
maybe
fight
once
with
the
other.
So
this
is
the
complexity
of
real
applications
and
real
software.
A
Four
interaction
with
compilers
kangoian,
reads:
the
performance,
optimization
report
of
kodi
and
compare
what
you
can
add
with
respect
to
the
existing
compiler.
You
can
be
compiling
your
code
for
nvidia,
compiler
and
permuter,
and
maybe
you
want
to
run
it
with
the
in
ibm
compiler
for
summit
or
with
the
amd
compiler
for
crasher.
So
you
need
integration
with
compilers
also
use
case
number
five
integration
with
build
systems
like
cmake
and
make
and
finally
benchmark
the
performance
on
the
platform.
A
So
for
these
use
cases
and
others
that
are
not
mentioned
here,
like
detailed
information
about
memory
access
patterns
about
data
scoping
about
memory
footprint
of
your
application,
we
have
more
advanced
capabilities
in
coding
that
we
have
not
explored
in
this
course
and
that
are
not
listed
here.
But
our
purpose
and
our
commitment
is
to
for
each
of
these
use
cases,
provide
you
with
a
video
that
you
can
watch
typically
something
with
around
three
minute
video,
where
you
can
see
how
to
use
the
capabilities
for
that
use
case
in
kodi.
A
Essentially,
I
want
to
remark
the
typical
use
cases
and
the
videos
that
you
have
if
you
have
any
requests
any
typical
use
case
that
you
don't
see
covered
here,
please
reach
out
to
nurse
calls
here
for
us
directly,
so
we
can
guide
you
how
to
use
it
and
even
eventually
produce
a
new
video
recording
that
can
help
you
and
others
in
the
community.
Okay,
please
feel
free
to
do
it
at
any
time.
We
proactively
will
be
addressing
those
two
suggestions.