►
From YouTube: DESI
Description
Laurie Stephey and Daniel Margala of NERSC present a talk on DESI. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Yan Zhang.
A
Hey
everybody
so,
as
young
said,
I'm
lori
steffy
I've
been
working
with
the
desi
team
for
about
three
years
and
I'd
just
like
to
acknowledge.
First
of
all
that
this
has
been
a
team
effort,
so
I'll
give
the
first
half
of
this
talk.
Daniel
will
give
the
second,
but
also
roland
thomas
and
stephen
bailey,
made
significant
contributions
here.
So
what
makes
this
talk
different
from
the
rest
of
the
agenda
here
at
gps,
for
science?
Is
that
desi
is
a
python
code.
A
Okay,
so
desi
for
anyone
who's
not
familiar
is
the
dark
energy
spectroscopic
instrument
they
are.
Their
mission
is
to
better
understand,
dark
energy,
and
the
best
way
to
do
that
is
by
making
a
3d
map
of
the
universe.
So
they'll
be
doing
that
over
the
course
of
a
five
year
survey
which
officially
starts
next
year,
although
they
they
started
taking
data
late
2019,
they
are
located
at
a
telescope
in
kitt,
peak
arizona
and
they'll,
be
scanning
the
sky,
taking
data,
sending
those
data
to
nursk
every
night
for
five
years.
A
So,
as
you
might
imagine,
that's
a
lot
of
compute
time,
and
so
it's
important
to
make
that
as
efficient
as
possible,
not
just
for
desi
but
for
everybody
else
running
on
nurse
who
also
wants
to
get
their
jobs
through
the
queue.
So
the
initial
charge
from
nisap
was
to
try
to
speed
their
code
up
on
the
knights
landing
partition
for
corey,
but
also
to
keep
their
code
in
python
desi.
The
developers
are
astronomers
they're,
not
necessarily
computer
scientists,
and
they
really
like
python.
A
A
A
So
what
are
your
options
if,
if
you're
a
python
programmer
things
are
changing
pretty
quickly?
So
what
what
I'm
telling
you
today
will
probably
be
out
of
date
soon
and
the
landscape
was
different
a
year
ago
when
we
were
starting
this
effort,
but
at
that
time
koopa
was
kind
of
the
best
documented
most
fleshed
out
option.
So
that's
what
we
chose.
A
What
is
coupon,
it's,
basically
a
drop
in
replacement
for
numpy,
so
it
looks
very
much
like
numhai,
but
instead
of
the
back
end
being
c
or
c,
plus
it's
cuda,
so
for
you,
the
python
programmer.
You
mostly
don't
have
to
worry
about
that.
You
just
continue.
Writing
your
code
with
some
caveats,
but
that
that's
what
we
chose
but
in
case
you're
wondering
there's,
there's
quite
a
few
additional
options
out
there.
A
I
don't
know
if
you
can
see
my
mouse
or
not,
but
on
the
left
is
koopa
in
the
middle
is
jacks,
which
is
a
framework
out
of
google
jax
uses
the
xla
compiler
to
write
code.
So
it's
a
little
more
portable
there's
also
number
which
I'll
talk
about
in
a
minute
and
if
you're,
a
scikit-learner,
pandas
user,
you
might
consider
nvidia
rapids.
So
there's
really
no
one
answer.
A
Okay.
So
if
you're
curious,
what
coupon
looks
like?
I
told
you,
it
was
easy,
but
don't
take
my
word
for
it.
So
it
looks
very
much
like
using
your
normal
numpy
except
you
can
import.
I
mean
you
could
even
import
numpy
kupay
as
np.
If
you
wanted
to
and
yeah
you
as
the
programmer
just
need
to
make
sure
that
you
move
your
arrays
to
and
from
the
gpu
and
kind
of
keep
track
of
where
they
are,
but
otherwise
it's
pretty
straightforward
to
use.
A
The
question
is
what,
if
you
need
something
that
coupy
has
not
implemented
so
they've
implemented
a
lot
of
the
numpy
interface,
but
some
some
more
custom,
more
niche
things
are
not
so
the
answer
is
you
need
to
do
it
yourself?
Okay,
fine,
there's
a
lot
of
different
ways.
You
could
do
that,
but
the
first
one
or
maybe
not
the
first,
but
the
easiest
one
and
the
most
python
like
one-
is
a
tool
called
number.
A
I
don't
know
if
anybody's
used
number
for
cpus,
it's
a
very
friendly
way
to
jit,
compile
and
really
speed
up
cpu
code
and
it's
similar
for
gpu,
except
that
it
you
really.
You
have
to
think
about
writing
code
for
gpu,
which
is
different
than
for
cpu.
So
what
does
that
mean?
Here's
just
a
very
basic
example
of
what
it
looks
like
to
use
number,
and
this
is
what
we
have
done
for
desi,
so
in
situations
where
they
needed
functions
that
were
not
in
kupai,
we
have
written
kernels
using
number.
A
So
you
add
this
decorator,
this
number.cuda.jitdecorator,
which
which
tells
your
code
hey
this
is
this
is
gpu
code.
We
have
this
coup
to
grid
function,
which,
basically,
is
how
you
communicate
with
the
gpu
threads,
and
you
need
to
double
check
is
make
sure
that
you're
not
exceeding
your
thread
block
size.
That's
what
the
zero
between
or
I'm
between,
0
and
32
is
doing.
You
can't
return
anything.
You
can't
allocate
any
memory.
You
basically
can't
do
most
of
the
things
that
you
would
like
to
do
in
numpy,
so
it
really.
A
It
starts
to
look
more
and
more
like
cuda
and
less
like
python.
It's
still
easier
and
more
friendly
than
some
of
the
other
frameworks,
but
yeah.
This
is
this
is
the
option
we
chose
for
desi,
okay,
so
with
coupon
and
number
we
were
able
to
get
their
code
on
the
gpu,
but
as
a
main
theme
in
gpu
for
science
day
has
been
it's
not
enough
just
to
get
it
on
the
gpu,
you
want
it
to
run
well.
A
Instead
of
cutting
your
your
task
up
into
lots
of
small
pieces,
which
is
maybe
good
for
a
cpu,
you
want
to
do
the
opposite
of
that
for
the
gpu
to
take
care
or
to
take
advantage
of
the
massively
parallel
nature
of
a
gpu.
A
So
we
we
kind
of
saw
this
coming
and
we
started
a
major
code
refactor
in
2019.
So
this
was
not
trivial.
We
had
to
kind
of
rethink
how
desi
was
approaching
the
problem
and
from
here
I'll
let
daniel
mulgala
take
over
and
tell
you
the
rest
of
the
story.
B
Okay,
so
yes,
so
I
joined
the
this
effort
a
couple
months
ago,
right
around
the
time
that
the
major
code
refactor
was
wrapping
up
and
I
helped
a
lot
with
implementing
a
lot
of
tests.
B
So
hopefully,
this
this
next
part
will
give
anyone
who
has
a
python
application
is
looking
to
leverage
a
gpu,
some
good
examples
of
how
they
can
get
started
in
that
process.
So
one
of
the
first
things
that
I
just
mentioned
was
implementing
a
lot
of
tests
to
make
sure
that,
as
we're
making
these
changes
that
the
results
are
still
correct,
or
at
least
the
same
as
the
cpu
version
was
producing.
B
So
it's
using
some
of
the
tools
that
were
just
just
mentioned
in
earlier
talks
like
the
insight
systems
to
identify
the
application
bottlenecks
and
give
us
some
clues
about
where
we
can
focus
some
of
our
efforts
for
optimization
and
as
laurie
mentioned,
our
main
strategy
is
to
use
coupe
as
a
drop-in
replacement
for
for
numpy,
and
there
are
some
places
where
we
also
implemented
number
kuda
kernels
as
well,
and
then
I
also
mentioned
this
nvidia
multi-process
service,
which
allowed
us
to
kind
of
explore
saturating
the
gpu
utilization
as
well,
okay,
so
so
yeah,
so
in
in
coupe.
B
I
couple
also
has
supports
the
nvidia
tools,
extension
markers
and
ranges.
So
it's
easy
to
add
these
little
decorators
throughout
the
code
to
to
annotate
the
timeline
that's
produced
by
the
nvidia
profiling
tools
that
give
us
information
about
where
the
time
is
being
spent.
B
While
our
code
is
running
and
so
there's
a
couple
different
ways
of
doing
this,
you
have
these
direct
and
nvtx
range
push
and
pop
functions.
You
can
also
add
the
fancy
python
decorators
to
functions,
so
you
don't
even
have
to
modify
any
part
of
the
function
body
and
there's
also
support
for
the
with
statement
context
box
as
well.
B
B
This
is
the
strategy
that
was
employed
on
the
on
the
cpu,
basically
to
leverage
a
lot
of
parallels
and
to
divide
the
task
into
lots
of
small
bits.
So
on
on
the
gpu,
with
these
nvtx
markers,
we
can
see
that
most
of
the
time
is
being
spent
in
this
function
and
it's
making
a
bunch
of
calls
to
this
hermitian
eigenvalue
decomposition
function,
but
we
also
noticed
some
some
unexpected
performance
issues
during
the
profiling.
B
So
there
are
some
gaps
that
just
intuitively
we
missed
when
we
were
adding
these
decorators,
because
we
didn't
expect
them
to
be
performance
intensive
but
like,
but
by
looking
at
the
profile,
we
saw
that
there's
some
blank
spaces
that
we
weren't
catching.
So
that
gave
us
some
clues
about
some
unexpected
things
we
could
speed
up
as
well.
B
So
the
main
lesson
here
was
that
basic
numpy
optimizations
are
still
still
useful,
so
pretty
much
anywhere.
There's
a
there's,
a
for
loop,
there's,
probably
an
opportunity
to
vectorize
some
part
of
the
code
and
so
yeah.
This
is
just
one
example
where
making
a
really
small
change
was
able
to
to
to
provide
a
noticeable
speed
up,
because
this
this
function
is
being
called
so
many
times.
Even
though
it's
a
pretty
small
change,
it
still
makes
a
pretty
big
impact.
B
And
the
other
topic
is
because
we're
running
a
bunch
of
these
small
functions
we
wanted
to
see
if
we
could
saturate
our
gpu
usage
by
using
this
multi-process
service
by
nvidia,
which
is
essentially
lets
multiple
processes
use
the
use,
the
gpu
by
overlapping
kernel
and
and
mem
copy
operations.
So
this,
let
us
kind
of
just
increase
the
number
of
of
processes,
kind
of
divide
up
all
those
tasks
to
more
processes
that
were
making
gpu
calls
and
gave
us
better
performance
just
by
turning
on
something
that
was
completely
outside
of
the
application.
B
B
And
then
the
the
next
part
we
were
trying
to
figure
out.
Okay
is
there?
Is
there
a
way
we
can
actually
optimize
this?
These
all
these
calls
to
this
hermitian
eigenvalue,
decomposition
and
so,
and
we're
able
to
leverage
a
batch
eigenvalue
solver
in
the
the
cuda
ku
solver
api,
to
give
us
to
basically
again
remove
this
for
loop
and
do
all
these
smaller
calls
to
this
function
in
just
one
call,
so
there
actually
wasn't
a
wrapper
for
this
in
coupey,
but
is
able
to
without
too
much
work.
B
Look
at
how
those
coup,
wrappers
wrap
other
calls
to
ku
solver
and
implement
this
ourselves,
and
this
also
example
demonstrates
how
we're
able
to
write
code
that
works
with
both
numpy
arrays
and
coup.
B
I
arrays
because
coupon
implements
some
some
helper
functions
to
return
the
array
module
that
your
actual
data
is
in,
so
you
can
write
code
that
will
work
both
numpy
arrays
and
coup
iras,
which
helps
again
with
the
porting
application
to
the
gpu
and
being
confident
that
the
results
are
still
the
same
yeah,
and
so
this
just
kind
of
is
a
demonstration
of
some
of
the
speedups
that
we've
been
able
to
to
make
so
by
batching
that
call
to
the
eigenvalue
decomposition
we
were
able
to
see
about
a
1.5
x,
beat
up
in
a
variety
of
different
run,
run
configurations
and
again
also
demonstrating
the
the
mps
how
that
helps
improve
importance
as
well.
B
C
Not
a
really
question,
I
think,
is
very
useful
from
both
of
your
slides
that
there
are
a
lot
of
developing
tools
for
the
bestie
project
and
then
the
profiling
debugging
tools,
so
I
think
for
knee
set
projects
when
you're
a
nurse,
it's
very
important
that
one
nissa
posed
up
helping
identify
the
developing
tool
for
certain
science
projects
and
then
the
matched
or
best
suitable
profiling
tools
for
that
and
that
part
can
be
transferable
to
other
knees,
app
or
the
future
needs
that
project.