►
From YouTube: Scaling Python Applications
Description
Part of the Data Day 2022 October 26-27, 2022
Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.
A
To
talk
about
scaling,
python
application,
so
a
nurse
coming
I
work
in
the
data
services
and
data
analytics
and
Services
Group
and
one
of
the
primary
things
I
do
is
help
our
python
users
use
python
on
the
nurse
supercomputers
and
in
particular,
kind
of
think
about
scaling
challenges
and
using
the
gpus
on
Pro
Mudder.
So
when
I
started
at
nurse
actually,
as
a
postdoc,
I
was
working
with
Desi
and
boarding
their
science,
their
science
code
processes,
data
from
a
telescope
in
Arizona
and
it's
all
implemented
in
Python.
B
A
The
Desi
pipeline
code,
which
is
all
implemented
in
Python
up
to
use
the
entire
full
system
of
a
pro
Mudder,
so
I
also
add
a
little
asterisk
there
because
again,
there's
a
lot
of
caveats
there.
But
these
slides
just
demonstrate
that
this
can
be
done.
You
can
move
your
code
over
to
the
gpus
and
you
can
run
at
at
the
entire
scale
of
the
promoter
so
to
get
started.
A
We're
actually
going
to
take
a
lot
of
steps
back
and
just
look
at
a
very
simple
problem
and
kind
of
use
this
as
an
example
to
think
about
parallelism
in
Python
and
kind
of
consider
different
options.
So
this
is
a
very
common
example
of
just
using
a
Monte
Carlo
method
to
estimate
the
value
of
pi
so
on
the
left
here:
kind
of
have
a
python
implementation
of
that
function
and
we
draw
random
samples
in
X
and
Y
position,
and
if
those
X
and
Y
positions
are
within
this
quarter
Circle,
then
we
we
count.
A
A
And
then
so
some
terms
in
the
next
few
flight
I'll
probably
use
many
times
so
I
just
want
to
maybe
throw
this
out
there.
So
when
I
say
a
program,
a
program
is
a
collection
of
instructions
that
a
computer
will
execute.
So
we
could
sort
of
think
about
that
that
file
there.
That
library.pi
that's
that's
kind
of
like
our
program.
A
Version
of
this
code,
so
this
is
just
a
single
threaded
version.
So
there's
no
parallels
in
here,
so
we
have
our
our
serial
version,
Pi
Dash
serial
Dot,
and
so
we
import
our
our
function
from
our
library.
We
say
we
want
to
generate
20
million
samples
and
we
run.
We
run
the
pi
estimation
code.
We
also
measure
the
time
that
it
takes.
A
So
if
we
look
at
the
simple
example,
it
takes
about
three
and
a
half
seconds
and
what's
Happening
Here
is
when
we
run
python,
that's
the
file
name,
we
start
up
the
python
interpreter,
so
the
python
interpreter
is
the
real
program
it
takes
in
our
file,
translates
that
into
Python
bytecode
and
then
passes
those
instructions
and
those
instructions
get
executed
at
runtime.
A
And
so
one
thing
to
point
out
here
is
that
python
is
slower
because
it's
interpreted
at
runtime,
it's
slower
than
compiled
languages
like
C
or
C,
plus
plus
portrait,
but
it's
a
very
popular
language
and
developers
like
it
because
they
feel
more
productive
and
it's
easier
to
use
than
some
of
those
compiled
languages.
A
A
But
on
the
bright
side
you
know
people
are
still
working
on
python,
the
language
and
so
python.
3.11
was
released
just
a
few
days
ago
and
it's
it's
getting
faster.
So
I
I
noticed
this
in
the
kind
of
release
notes
for
the
the
new
python
version
It's.
They
say
it's
about
10
to
60
faster
than
the
previous
version
and
I
tested
this
on
our
simple
code,
and
it
is
a
lot
faster
and,
as
my
colleague
Lori
Steffy
likes
to
stay,
green
speed
up
is
the
best
seat
up.
A
Okay,
so
the
first
parallelism
example
that
we
want
to
look
at
is
kind
of
multi-threading,
so
one
issue
with
parallelism
in
Python
is
that
multi-threading
is,
is
not
really
helpful
for
compute
bound
tasks
like
our
simple
Pi
estimation
thing,
and
that's
because
of
this
thing
called
the
global
interpreter
lock,
and
so
we
can't
really
get
into
the
details
of
that
here.
A
But
this
example
just
shows
a
case
where
we
create
multiple
threads,
so
here
creating
four
threads
and
I
give
each
of
those
threads
a
portion
of
the
work
to
do
so,
I'm
giving
it
a
quarter.
Each
thread
gets
a
quarter
of
those
number
of
samples
to
generate,
and
then
we
start
our
Benchmark,
we
say
start
equals
time
to
time
and
then
the
threads
actually
launch
using
when
that
start
method
is
called
for
each
thread
and
then
the
main
process
thread
keeps
going
as
as
it
launches
each
of
those
other
threads
in
it.
A
And
then
it
won't
wait
for
those
threads
to
to
finish
until
you
call
the
join
join
method
on
that,
and
so
we
notice
when
we
run
this
program,
it's
actually
slightly
slower
than
this
completely
serial
version
and
that's
again
because
of
this
Global
interpreter
lock.
So
it
doesn't
help
us
so
multi-threading
is
typically
not
going
to
help
you
in
Python.
A
There
are
some
cases,
though,
where
it
does
help.
So
if
you
have
non-compute
bound
things
so
things
like
I
O,
if
you're
like
waiting
for
a
file
system,
I
o
operations
or
something
like
that.
Multi-Threading
can
help
in
that
case.
So
this
this
is
just
a
quick
example
of
just
showing
for
a
case
where
multi-threading
can
actually
help,
but
most
people
aren't
just
calling
sleep
in
their
scientific
data
processing
code.
B
A
For
things
like
like
web
servers
and
stuff
like
there
are
use
plenty
of
valid
use
cases
out
there
in
the
in
the
wild,
where
multi-threading
is
helpful
and
then
looking
further
beyond
the
current
release
of
python.
Well,
I
noticed
one
of
the
goals
in
the
the
next
for
the
next
version
of
python
is
actually
developing
some
work
around
this
multi-threaded
parallelism.
A
So
in
multi-processing
now
it's
we
kind
of
bypass
that
the
Gill,
the
global
interpreter
lot
by
spawning
up
separate
processes,
and
so
those
processes
can
run
in
parallel
and
and
make
progress.
So
so
here,
I
have
a
simple
example:
again,
a
version
of
our
program
where
we
start
up
four
new
processes
using
the
multi-processing
pool
method
and
then
again
we
pass
each
of
them
a
quarter
of
the
work.
A
And
now
we
see
we
do
see
a
good
speed
up
here,
not
quite
a
factor
of
four
but
but
close
to
a
factor
of
four
and
then
I
also
just
wanted
to
highlight
the
way
those
processes
start
up.
It
can
vary
I'm,
demonstrating
that
using
the
spawn
method,
which
is
not
the
the
default
method
on
Linux
systems,
because
it's
a
little
more
composable
with
the
MPI
using
MPI
on
on
HPC
systems,.
A
So
speaking
of
so
MPI,
MPI
stands
for
the
message.
Processing
message
passes
in
interface
and
it's
really
just
it's
a
standard
which
defines
a
set
of
Library
functions
that
facilitate
inter-process
communication.
So
one
thing
I
was
kind
of
cheating
in
those
last
two
examples
is
I,
didn't
really
collect
the
results
from
each
of
those
separate
processes
or
threads
and
try
to
combine
them
and
I
did
a
very
just
simple.
A
You
know
telling
each
thread
or
process
how
much
work
to
do.
I
didn't
like
pass
a
lot
of
data,
so
MPI
really
think
gives
the
the
user,
like
a
common
set
of
functions
that
they
could
use
for,
for
sharing
data
between
processes
and
in
Python.
We
could
use
MPI
for
pi,
which
Builds
on
top
of
that
specification
and
provides
an
interface
where
you
can
just
pass:
pickable
python
objects
and,
and
things
like
numpy
arrays
to
those
Collective
or
commute
those
communication
functions.
A
So
here's
an
example
using
MPI
for
pi
in
Python.
So
one
thing:
that's
that's
different
about
this.
If
you
notice
our
our
execution
command
here,
we
have
this
s,
run
Dash
N4
python.
A
So
now
we
have
something
external
to
the
python
interpreter
that
we
use
to
launch
our
program
and
so
that
that
launcher
launches
for
processes
and
those
processes,
sync
up
during
the
MPI
initialization,
when
your
processes,
when
they're
all
each
of
them
are
executing
so
and
here
it
happens
in
that
line
from
MPI
for
pi
import,
pi
all
of
those
process
kind
of
sync
up
and
they
figure
out
how
they're
going
to
communicate
with
each
other.
A
And
so
you
have
this
Communicator
object
now,
which
you
can
use
so
so
here
again
we're
not
doing
very
very
much
it's
a
pretty
simple
example,
but
the
those
com,
dot
barriers
are
saying
are
making
sure
each
of
those
processes
are
in
sync
before
the
next
before
they
move
on
to
the
next
bits
of
instructions
in
their
process.
A
Another
very
popular
parallelism
framework
for
python
is
is
dasc,
so
desk
is
a
very
popular
tool
in
the
python
Community
and
I
I.
Don't
have
I
won't
go
into
a
lot
of
details
here.
I
just
wanted
to
share
this,
because
this
is
this
is
popular
and
another
nice
thing
about
this,
which
I
didn't
mention
about
MPI,
but
the
MPI
gives
you
a
way
of
of
scaling
out
not
just
within
the
server
or
the
node,
but
beyond.
A
Using
multi-node,
parallels
and
so
desk
also
gives
you
a
good
way
to
scale
out
to
multiple
nodes
as
well,
and
then
there's
there's
also
a
lot
of
documentation
with
examples.
There's
a
lot
of
different
ways
to
use
desk
and
the
documentation
is
pretty
good,
though
it
has
a
lot
of
examples
and
tips
for
performance.
A
Another
thing
you
should
consider
strongly:
you
probably
already
are,
if
you're
doing
science
when
python
is,
is
doing
array
programming,
but
I
wanted
to
call
it
out,
because
it
really
is
kind
of
the
foundation
of
doing
a
lot
of
scientific
data
processing
with
python,
and
it
also
I,
guess
I'll
just
show
the
next
example
here.
So
here,
I'm
just
kind
of
demonstrating
what
array
programming
is
here.
A
Here's
a
version
of
our
simple
example
using
array
programming,
so
I've
redefined
our
estimate
Pi
function
here,
but
now,
instead
of
using
like
a
for
loop,
we're
creating
an
array
of
random
uniform
samples
and
then,
instead
of
you,
know,
looping
through
each
of
those
samples,
we
can
perform
array
operations
on
those
using
the
numpy
API,
and
this
gives
us
a
way
to
kind
of
bypass
the
that
Global
interpreter
line,
because
the
numpy
is
built
on
top
of
of
c
and
Fortran
libraries.
A
And
so,
when
you
call
these
array,
vectorized
operations,
you're,
actually
calling
into
code,
that's
implemented
in
a
lower
level,
language
like
C
or
C
plus,
and
so
you
get
C
like
performance
or
Fortran
like
performance
when
you
use
numpy.
So
this
is,
this
is
incredibly
useful
and
you
see
this
is
actually
faster
than
than
all
the
previous
examples
combined
or
not
combined,
but
just
all
of
the
previous
examples.
A
But
this
also
adds
a
little
bit
of
a
challenge
here,
because
now
we
actually
have
kind
of
multiple
levels
of
parallelism
to
think
about.
So
here's
another
example
of
using
it's
a
different
example
using
numpy
where
we're
creating
a
matrix,
a
thousand
by
a
thousand
and
then
we're
turning
it
into
like
it's
just
a
random
Matrix,
but
we
turning
it
into
a
positive
symmetric,
positive,
different
make
definite,
Matrix
and
I
just
want
to
kind
of
Benchmark.
A
This
eigenvalue
decomposition
function
in
in
the
numpy
API,
and
so
the
bottom
here
I'm
just
using
a
python
module
that
helps
with
some
some
benchmarking.
It's
not
that
big
of
a
deal,
but
it
takes
about
half
a
second
to
execute
to
do
perform
this
eigenvalue
decomposition
on
this
a
thousand
by
a
thousand
Matrix,
and
so
the
numpy
code
is
calling
into
this
lower
level
last
back
end,
which
is
typically
something
like
open,
Blaster,
Intel's,
mkl
and
those
libraries
use
are
multi-threaded.
A
A
But
that
openmp
runtime
that's
used
by
the
the
back
end
in
numpy.
It
chooses
at
runtime
how
many
threads
it
should
use.
So
this
is
kind
of
something
is
something
to
think
about
when
you're,
combining
parallelism,
composing
parallelism
in
Python
is
thinking
about
like
how
many,
how
much
resources
you're
using
and
how
composable
are
those
different
layers
of
of
parallelism.
A
So
here
I
kind
of
highlight
on
in
in
Orange,
where
the
optimal
number
of
threads
that
that
runtime
Library
should
use
to
perform
this
operation
is,
is
not
the
default
value,
and
so
here
what
I'm
doing
is
a
scaling
study
where
I
explicitly
limit
that
openmp
runtime,
how
many
threads
it
should
use
and
then
run
that
Benchmark
to
measure
the
performance,
and
so
this
this
lets
me
kind
of
build
an
intuition
of
what
the
optimal
number
of
threads
I
should
let,
but
that
piece
of
code
have
or
specified
for
that
piece
of
code
and
I'll
just
point
out
like
by
default.
A
Typically,
the
the
openmp
runtime
will
will
use
we'll
choose
one
thread
per
core,
so
on
promo
on
a
pro
model:
CPU
node,
for
example-
that
would
be
128
threads,
and
that
gives
us
a
value
that
was
close
to
that.
What
I
showed
in
the
previous
slide.
A
So
that
type
of
kind
of
scaling
performance
is
a
powerful
tool
for
understanding,
so
kind
of
we
started
off
with
a
single
threaded
example
and
then
I
showed
a
couple
different
ways
of
doing
some
sort
of
multi,
apparelism,
multi-threading
or
multi-processing,
but
I
showed
everything
with
an
example
of
using
like
four
threads
or
four
processes.
A
So
any
honors
I
want
to
understand
the
performance
of
your
code.
It's
good
to
kind
of
do
this.
What
what
we
refer
to
as
a
scaling
analysis,
where
you
vary
the
number
of
processes
or
threads
and
kind
of
look
at
the
behavior.
So
in
this
case
here
I'm
showing
an
example
from
from
code
that
I've
worked
on
moving
over
to
the
GPU
and
the
blue
line.
A
Kind
of
just
show
like
the
original
CPU
implementation
and
I
measure,
the
runtime
of
the
the
whole
program
as
I
increase
the
number
of
tasks,
and
so
they
kind
of
go
down.
It
kind
of
goes
down
almost
perfectly
for
a
little
bit,
but
then
it
starts
to
kind
of
flatten
out.
It's
typically
what
we
see
as
you
increase
the
number
of
processes,
usually
there's
some
overhead
or
communication,
or
something
that
that
kind
of
slows
down
performance.
You
don't
keep
getting
these
perfect
speed.
A
Ups
I
think
this
is
a
really
powerful
tool,
while
you're
developing
or
moving
things
over
to
the
GPU,
or
something
like
that.
Because
when
you
make
changes
to
your
code,
you
might
you
change
the
performance
at
different
scales
of
parallelism.
So
this
is
just
an
example
illustrating
that
and
here's
a
here's.
Another
example
where
different
I've,
where
I'm
capturing
the
runtime
of
not
just
the
total
execution
time
of
the
program,
but
also
measuring
kind
of
the
specific
steps
that
make
up
that
program
and
I
just
want
to
highlight
one.
A
One
thing
here
is
the
there's,
the
black,
the
Total
Line
here.
So
it's
going
down
for
a
little
bit
and
it
kind
of
bottoms
out
around
32
or
64
tests
and
then
at
128.
It
starts
to
dip
back
up
and
you
can
see.
There's
one
one
line:
that's
really
increasing
throughout
that
whole
time
and
that's
the
import
section.
So
here's
an
example
where
the
import
step
in
Python
you're
just
importing
all
the
libraries
that
you're
going
to
need
at
the
top
of
your
program.
As
you
add
more
and
more
tasks.
B
A
Okay,
so
I'm
gonna
switch
gears
a
little
bit
just
to
because
I
want
to
cover
using
gpus
as
well,
so
yeah
I
mentioned
gpus,
and
so
just
I
also
point
out
that
just
out
of
the
box,
you
can't
use
the
gpus
using
numpy
or
scipy
they're
not
set
up
to
to
do
any
sort
of
computation
on
the
GPU.
A
So
some
of
those
I
kind
of
give
you
like
a
drop
in
replacement
for
numpy
or
scipy,
the
pandas
or
scikit-learn
are
from
something
called
coupon
gives
you
like
the
the
numpy
API
but
lets
you
use
arrays
that
are
stored
in
GPU
memory
and
Rapids
also
provides
things
like
like
pandas
and
scikit-learn,
but
it
runs
that
stuff
on
the
GPU
there's
also
the
machine
learning
libraries
and
a
lot
of
those
also
do
more
than
just
machine
learning.
They
also
provide
kind
of
array
like
apis
and
support
General
GPU
Computing.
A
So
it's
like
Pi
torch,
tensorflow
and
Jacks,
and
we
have
some
talks
later
this
afternoon
about
about
those
and
if
you
want
to
get
into
like
more
lower
level,
GPU
programming.
There
are
a
lot
of
options
too,
like
number
Pi,
opencl,
Pi,
Cuda
and
Cuda
python.
Also
give
you
ways
to
to
kind
of
dig
a
Little
Deeper.
A
A
So
if
you
want
to
use
more
all
four
of
those,
it's
kind
of
like
a
similar
challenge
to
scaling
out
to
using
multiple
notes
and
and
python,
so
you
kind
of
typically
would
need
to
combine
some
distributed
memory
parallelism
with
with
your
GPU
library
of
choice,
so
the
work
that
I've
done,
I've
used,
MPI
plus
like
coupon,
for
example,
for
for
multi-gpu
multi-node
programming.
A
You
can
also
achieve
something
like
that
with
with
desk
and
even
multi-processing
but
multi-processing,
but
it's
a
little
bit
more
work
and
then
there
are
other
options
that
are
maybe
a
little
bit
more
experimental
like
like
coup
numeric.
But
it
could
be
something
to
look
at
keep
it
keep
in
mind
in
the
future.
A
And
I'll
also
just
point
out
here
at
the
bottom
of
this
I'm,
just
kind
of
demonstrating,
there's
there's
so
many
of
these
Frameworks.
It's
kind
of
almost
a
little
messy
like
trying
to
you,
know
compose
these
or
mix
and
match.
But
there
is
an
effort
in
the
community
to
sort
of
I
mean
there's
a
recognition
of
this
issue
and
the
community
is
trying
to
standardize
around
a
common
python
array.
Api.
A
And
just
to
give
you
a
little
example
of
that,
you
could
combine
kind
of
writing
like
a
low-level
Cuda
kernel
using
number
Cuda.
That's
on
the
left
you
can
on
the
right.
I
have
an
example:
where
use
the
coupon
API
to
create
an
array
on
the
on
the
GPU
device.
A
A
And
then,
when
you're
we're
thinking
about
which
code
should
you
move
over
to
the
GPU
so
right
now
I,
imagine
you
know
your
if
you've
you're
already
have
a
a
code
or
application
that's
running
on
the
CPU,
you
don't
want
to
just
move
the
whole
thing
over
to
the
GPU
in
one
go.
You
want
to
understand
where
the
performance
bottlenecks
are
of
your
current
application
and
then
figure
out
if
you
can
get
a
speed
up
or
or
get
some
performance
benefit
by
moving
that
over
to
the
GPU.
A
This
is
not,
oh,
you
don't
want
to
move
everything
over
there
because
there's
a
there's,
an
overhead
to
launching
GPU
kernels,
so
here
on
the
right.
I
have
an
example
where
we're
just
doing
a
DOT
product
or
a
matrix
multiplication.
Really
it's
a
two-dimensional,
arrays
and
I.
Have
this
XP
random,
because
I'm
using
either
the
numpy
API
or
the
coupon
API,
so
in
blue,
you
can
see
for
small
Matrix
sizes
is
number
here
we
get
very
fast.
A
This
is
very
fast
operation
until
about
a
size
of
like
20
or
30
or
so,
and
then
it
jumps
up
to
kind
of
taking
about.
Like
a
millisecond
or
Beyond
as
we
get
as
we
keep
increasing
the
scale,
if
you
compare
that
to
coupon,
you
can
see
that
that
performance
is
really
relatively
flat
and
but
it's
flat
for
a
lot
longer
than
the
numpy
version.
So
after
about
for
Matrix
size
about
20
or
30
or
whatever,
it
is
actually
beneficial
to
do
this
operation
on
the
GPU.
A
So
if
you're
your
algorithm
or
your
code,
if
it's,
if
it's,
if
it's
working
with
large
arrays
or
matrices
or
images,
those
are
good,
you
know
there's
a
good
chance
that
you
could
see
a
speed
up
on
the
GPU
and
another
thing:
if
your
application
is
already
I
O
bound,
then
you
know
you
have
to
read
all
if
your
application
already,
if
that's
already
the
bottleneck,
then
moving
to
the
GPU,
is
not
going
to
fix
your.
I
o
iosu.
A
And
then
just
kind
of
a
thing
to
keep
in
mind
CPUs
are
are
great
they're
super
fast
for
doing
like
a
few
operations
like
a
few
things
in
in
parallel,
the
gpus
are
maybe
a
bit
slower
for
like
a
single
thread,
but
they
have
so
many
so
many
threads
that
it's
a
higher
you
get
higher
throughput
with
the
gpus.
A
So,
just
to
kind
of
wrap
up,
I
think
you
know
one
of
the
most
powerful
things
you
could
do
to
improve
the
your
your.
The
performance
of
your
code
is
to
really
like,
learn
and
and
really
become
an
expert
at
a
rate
programming
at
numpy
and
eliminate
many
for
Loops
in
your
program
as
as
possible,
using
kind
of
the
vectorization
and
broadcasting
and
indexing
features
of
numpy
had
felt
a
number
of
users.
A
You
know:
how
do
we
move
stuff
over
the
GPU
but
I've
seen
so
many
times,
or
even
just
removing
python
for
loops
and
using
numpy
API
is
a
huge
benefit
already
and
then,
with
things
like
coupon
that
lets
you
use
the
same
exact
API,
but
now
in
the
GPU
you
almost
get
all
of
that
work
of
moving
over
to
the
GPU
for
free
and
then
another
thing
to
keep
in
mind
when
you
start
running
it
at
scale.
A
Is
that
you
know
python
is
a
file
system,
intensive
language,
and
we
see
this
a
lot
of
times
as
you
increase
your
process
count
and
the
number
of
nodes
that
that
file
system
startup
becomes
to
be
an
issue.
A
A
So
for
this
week's
scaling
thing
where
I
ran
the
Desi
pipeline
at
all
the
way
up
to
1500
nodes
on
promoter,
you
can
see
that
the
performance
was
pretty
flat,
all
the
way
out
to
about
300
or
so-ish
nodes.
Then
it
starts
to
pick
up
so
when
I
ran.
This
I
was
running
32
tasks
per
per
node,
so
this
is
starting
up.
You
know
tens
of
thousands
of
processes
all
the
same
time
and
I
I
did
not
actually
use
a
container
for
this.
A
So
at
the
time
you
know,
I
was
able
to
get
pretty
far
without
having
to
do
this,
but
that
take
up
at
the
end,
really
is
just
due
to
the
startup
time.
So
beyond
100-ish,
no
or
300ish
nodes,
then
that
python
startup
is
really
taking.
You
know,
15
minutes
just
for
the
application
to
start.