►
From YouTube: 7. Introduction to Dask + GPUs
Description
From the NERSC NVIDIA RAPIDS Workshop on April 14, 2020. Please see https://www.nersc.gov/users/training/events/rapids-hackathon/ for all course materials.
A
So
for
those
who
have
done
the
notebooks
and
in
advance,
this
will
look
fairly
familiar,
but
there
will
be
a
difference
in
that.
I
am
not
running
this,
as
you
can
see
right
right
at
the
top
here.
I'm
not
running
this
on
the
quarry
system.
I'm
gonna
be
running
this
on
you,
one
of
our
internal
servers
and
NVIDIA
and
as
a
result,
I'm
gonna
also
show
and
highlight
how
this
works
without
just
using
a
single
GPU,
but
how
it
works
with
multiple
GPU.
So
you
can
get
a
visual
sense
of
that.
A
If
you
did
the
notebooks,
but
don't
have
current
access
to
multiple
GPUs,
but
so
anyway,
driving
in
as
vibhut
mentioned,
your
desk
is
a
flexible
library
for
parallel
computing
and
it
makes
scaling
out
easy
and
we
have,
you
know,
got
a
lot
of
work
to
support
the
desk
community
and
contribute
into
the
desk
development
support
for
GPUs,
in
particular,
support
for
ku
DF
and
enhanced
support
for
coop
I
erase,
and
so
in
general.
There's
a
couple
things
I
want
to
preface
this
with
tasks
is
great.
A
It
is
scale
up
and
scale
out
too
many
machines,
but
task
does
introduce
a
small
amount
of
overhead.
Any
distributed.
Computing
framework
will
introduce
overhead
if
your
workload
fits
on
a
single
machine.
That's
just
the
nature
of
distributed
computing.
It
has
to
have
overhead,
and
desk
is
really
efficient.
So
it's
great,
but
if
your
workflow
is
fast
enough
on
a
single
GPU
or
your
data,
comfortably
fits
in
memory
on
a
single
GPU.
You
wouldn't
want
to
use
desk
unless
you
expected
it
to
scale.
A
You
would
want
to
just
stay
in
the
single
machine,
libraries
of
kudi,
F
or
Koopa,
and
that
applies
both
to
CPUs
and
GPUs
them.
The
same
applies
for
using
pandas
and
numpy
with
that
said,
there's
a
little
bit
of
a
benefit
that
doesn't
come
through
on
the
GPU
in
the
same
way,
which
is
when
you
use
pandas
for
an
unpopular
pandas
on
your
laptop.
You
can
use
to
use
all
of
your
cores,
which
might
not
already
be
happening
on
the
GPU.
A
A
I'm
gonna
create
a
desk
cluster
with
a
couple
of
commands,
so
there's
some
things
here
that
our
desk
CUDA
specific
and
in
particular
CUDA,
is
this
the
add-ons
that
allow
it
to
work
well
with
GPUs,
which
we
were
working
on
upstreaming
and
some
of
them
have
been
up
streamed,
so
we're
gonna
import.
This
local
CUDA
cluster,
which
of
the
boo,
alluded
to
in
the
presentation-
and
he
mentioned
how,
like
the
general
pattern
for
desk,
is
you
create
a
cluster
and
you
scale
it
up.
That
is
the
general
pattern.
A
In
this
case,
though,
we
don't
need
to
scale
it
up,
because
it's
just
gonna
use
the
entire
local
machine
unless
we
tell
it
only
use
GPU,
zero,
we're
also
gonna
use
a
client
from
distributed.
This
client
is
how
we
connect
and
interact
with
our
cluster
scheduler,
so
I'm
gonna
fire
this
up
right
now,
I'm
gonna
use
a
single
GPU
and
note
that
I'm
also
setting
a
memory
limit
here.
This
is
a
memory
limit
for
the
GPU,
which
means
task
is
going
to
target
keeping
memory
below
4
gigabytes
on
this
GPU.
A
Now,
if
work
is
happening
and
it's
going
above,
4
gigabytes,
that
work
won't
be
eliminated,
what
will
happen
is
that
work
will
be
spilled
to
CPU
host.
You
know,
host
memory
or
other
work
will
be
spilled
to
make
room
for
it
and
then
it'll
be
brought
back
into
the
GPU
as
appropriate.
So
you
can
see
here
that
I
have
a
client
in
the
notebooks.
This
configuration
on
on
yours
has
not
been
commented
out.
This
is
to
help
you
access.
A
A
This
is
the
Status
page,
and
by
default
it's
showing
me,
you
know
a
little
bit
of
activity
is
happening
in
this
case.
I
don't
have
much
going
on
in
GPUs,
but
it
could
show
that
I
had
memory
already
allocated
if
I
have
been
doing
other
work
or
some
what
else
I've
been
using
the
machine.
This
would
tell
me
how
much
memory
is
being
allocated
on
GPUs.
A
A
In
this
case
is
running
and
there's
one
worker
attached
to
it,
as
weaboo
mentioned,
we
operate
with
a
one
worker
per
GPU
model
and
I've
assigned
it
to
be
using
one
GPU,
so
I
have
one
worker
and
we
can
see
that
in
this
worker
tab
where
I
have
one
worker-
and
this
gives
me
metrics
about
the
CPU
utilization
what's
going
on
and
all
sorts
of
things
like
that,
I
can
also
look
at
a
task
stream,
but
not
just
the
running
tasks
dream.
This
is
gonna,
be
a
live
version.
A
So
we'll
create
some
random
data.
In
this
case,
we're
gonna
create
a
distributed
array
using
the
desk
arrays
random
state.
We're
gonna
do
this
on
the
GPU,
so
we
can
create
a
random
GPU
array
by
using
coop
I
here.
If
we
wanted
to
use
a
CPU
array,
we
could
call
it
with
numpy
and
I'd
have
to
import
numpy,
though,
but
we
could
do
the
same
thing,
and
so
you'll
notice
here
that
you
know
with
this
generator
I,
can
create
a
fairly
large
array
of
100,000
by
1,000
and
I'm
choosing
that
chunk
size.
A
A
It
takes
a
call
to
persist
for
us
to
execute
this,
and
so
this
is
just
like
apache
spark.
It's
lazy
execution,
it's
very
common
in
parallel
processing,
because
it
lets
us
do
optimizations
once
we
know
what
the
full
task
graph
is.
So,
let's
run
this.
This
is
gonna,
do
what
we
just
said:
it's
just
gonna
create
some
random
data.
Now,
let's
take
a
look,
so
what
happened?
Our
GPU
is
running
the
code
and-
and
it's
already
finished,
so
we
don't
get
to
see
it
really
screaming.
A
Hope
you,
you
know
a
half-second,
you
can
see
it
and
I
can
zoom
in
if
it's
not
big
enough.
So
let
me
let
me
zoom
in
this
is
the
random
sample
task.
There
were
100
tasks,
all
of
them
succeeded,
they're,
all
done,
and
now
a
couple
things
have
happened.
We've
got
a
task
history,
we
have
a
history
and
we
should
have
a
profile
when
we
do.
This
tells
us
where
x
being
spent.
So
we
can
see
that,
as
expected,
we
spent
our
time
doing.
Random
sample
no
surprises
there.
A
We
had
a
hundred
tasks,
because
this
100,000
by
1,000
array
in
chunks
of
1,000
by
1,000,
naturally
is
going
to
have
100
chunks,
and
we
can
see
that
right
here
provides
a
really
nice
string,
representation
that
actually
uses
HTML
in
notebook
cells.
To
present
this,
you
can
see
that
it
gives
you
a
shape
of
the
array.
This
is
a
tall
and
skinny.
Excuse
me,
this
is
a
a
tall
and
skinny
array.
The
array
is
800
megabytes.
A
Now
this
is
that
visual
representation
is
great,
but
let's
do
some
work
now.
So,
let's
actually
schedule
some
work
with
the
same
operation
of
the
boo
mentioned
in
the
slides,
singular
value.
Decomposition
is
a
matrix
decomposition
and
that
ran
instantly
because
it
didn't
actually
do
the
compute.
What
we've
done
is
scheduled
work
and
we
can
see
this
very
explicitly
notice
now
that
we
have
708
tasks
to
do
on
this
on
these
objects
before
we
only
had
a
hundred
tasks.
A
Task
maintains
a
single
task
per
chunk
or
port
partition,
and
this
makes
sense
it
has
to
organize
and
orchestrate
it,
but
this
actually
added
608
tasks
to
our
graph,
but
we
haven't
done
anything
yet.
So,
let's
do
something.
We'll
call
persists
to
run
all
these
at
once
and
then
we'll
use
this
wait
keep
command.
A
Now
this
is
optional,
but
sometimes
we
want
to
wait
for
the
results
of
all
of
these
asynchronous
operations
to
happen
before
we
go
forward,
so
we
call
wait,
but
it's
not
actually
necessary,
but
it's
a
nice
convenience
function,
at
least
in
this
point.
Sometimes
it's
useful
for
worklets,
but
in
this
case
it's
more
of
a
convenience,
so
I'm
going
to
launch
this
and
go
back
to
the
scheduler
dashboard
page,
and
so
a
lot
of
stuff
is
happening.
A
lot
of
stuff
just
happened.
Let's
take
a
look.
A
There
were
a
bunch
of
different
operations
that
we
had
to
do
in
order
to
make
this
singular
value.
Decomposition
compute
actually
happen.
We
called
SVD,
we
called
dot
products.
We
had
a
QR
decomposition.
Why
do
we
have
a
QR
decomposition?
Well,
it
turns
out
that
a
distributed
algorithm
for
SVD
does
rely
on
QR
decomposition
x',
and
you
know
that
goes
into
the
internals
of
desc,
but
we
can
see
each
of
these
tasks
and
we
can
see
in
our
task
graph
that
they
all
took
different
amounts
of
time.
A
A
Comprehend
so
on
and
so
forth,
and
then
I
can
just
reset
with
a
single
click
very
convenient,
and
the
profile
of
course,
was
updated
too.
Now
time
has
been
spent
in
multiple
places.
The
time
we
spent
with
the
random
sampling
is
now
much
less
than
the
time
we
spent
doing
things
like
QR,
decomposition
from
the
linear,
algebra
library,
or
from
the
actual
SVD
arrey,
arrey
function
and
and
so
on
and
so
forth.
So
that's
sort
of
how
dass
works,
but
now
the
results
are
still
distributed.
A
We
can't
really
look
at
these
results
if
I
click,
if
I
look
at
you,
it's
a
you
know,
it's
a
distributed
array,
there's
no
more
tasks,
waiting
to
complete
we're
back
to
the
100
tasks
for
the
100
chunks,
but
I
can't
see
anything
well.
I
can
call
compute
right
now
to
make
this
array
essentially
one
partition
and
just
grab
that
underlying
coop
IRA.
So
in
this
case
I'm
just
gonna
slice
in
and
grab
a
5x5
slice
and
there
it
is
so.
This
is
my
actual
coop
IRA.
A
That
is
these
25
values,
and
you
can
see
that
if
I
type
type
it's
actually,
my
coop
IRA
dast
manages
this,
and
it
knows
what
to
do
to
give
me
that
compute,
this
tiny
call
right
here
was
my
get
item.
Call
I
was
accessing
those
elements
from
those
distributed
arrays
and
that's
really
all
there
is
to
it.
A
We
can
do
the
same
thing
now,
not
using
arrays,
but
with
data
frames
and
we'll
go
through
a
couple.
Other
data
frame
examples
showing
some
fairly
complex
operations,
so
this
is
going
to
generate
some
random
data
frame.
Data
and
you'll
see
that
we're
calling
functions
that
are
being
executed
later.
This
right
here
didn't
actually
do
any
compute
until
I
called
head.
Why
did
that
happen
getting
the
beginning?
Are
they
in
this
case?
The
first
few
rows
of
a
distributed
array
is
not
a
lazy
operation
when
I
call
head
on
this
object.
A
This
comes
active
and
in
order
to
actually
give
me
these
values
I
had
to
do
computation.
That's
why
we
saw
this
computation
actually
happened.
This
purple
bar
right
here
where
I
called
head
and
it
actually
it
actually
triggered
the
data
generation
itself.
So
hopefully
this
is
beginning
to
make
sense
where
most
operations
are
lazy,
but
we
can
explicitly
force
computation
with
persist.
We
can
force
computation
by
calling
head
to
inspect
things,
but
it's
nice
that
we
can
do
things
lazy
because
it
lets
us
optimize.
A
So
to
make
a
more
complicated
example,
we
can
take
this
data
frame,
which
we
can
you
know
see
how
long
it
is.
It's
actually
a
fairly
large
data
frame
and
we
can
call
length
on
it,
and
so
you
know
we
got
a
variety
of
length.
Combs
there's
one
task
for
each
of
the
partitions
and
in
this
case
there
were
60
partitions,
which
we
can
see
right
here.
Actually
sorry,
there
were
30
partitions,
which
we
can
also
get
right
here
and
with
30
partitions.
A
We
call
length,
we
get
30
tasks
and
we
see
that
there
are
two
and
a
half
million
rows.
So,
let's
do
a
group
by
this
is
a
fairly
large
group,
I
and
notice,
though,
most
importantly,
it's
the
same
API
as
we
saw
in
the
coudé
EF
notebooks.
These
api's
don't
change
at
all
and
also
notice
that
I'm
calling
head
here,
which
is
gonna
force
the
computation
to
actually
happened
so
we'll
watch
this
dashboard.
While
it's
happening,
you
see
that
we're
doing
aggregations
across
different
chunks,
we're
doing
lots
of
operations.
A
A
Okay,
maybe
have
to
zoom
in
a
little
more
okay.
So
this
is
an
example
of
how
using
the
profiler
can
be
really
valuable.
We
saw
the
tasks
happen,
but
the
tasks
have
have
gone
away.
Now
our
profile
is
a
little
complex.
So
if
we
want
to
understand
you
know,
how
long
are
we
spending
in
this
workflow?
Where
where's
the
group
eye
time
spent,
we
can
dig
into
these
profiles
and
we
can
see.
A
Oh
wait:
okay,
I
called
group
by
that
aggregation
that
took
one
point
five
six
seconds
to
do
all
the
steps
that
needed
to
do
to
make
that
happen.
There
we
go
there's
our
answer
and
of
course
we
had
our
results.
You
know
fairly
quickly
because
this
is
on
a
GPU.
We
can
do
hierarchical,
multi-column
group
eyes
with
multiple
aggregations
on
millions
of
rows,
and
you
know
a
second
which
is
great
and
so,
but
when
we've
every
time
we
run
this
we're
creating
the
data
right.
A
A
Well
in
theory,
in
theory,
in
theory,
it
runs
quite
a
bit
faster.
The
aggregation,
I
didn't
measure
it,
but
I
measured.
It
would
have
been
faster
because
it
didn't
have
to
wait
for
this
data
generation.
But,
of
course
the
results
are
gonna
be
the
same
and
if
the
same
API
is
qdf
that
works
across
multiple
GPUs
we're
doing
it
with
one
GPU
here,
but
in
a
second
I'll
show
it
with
a
lot
of
GPUs
and
before
we
do
that,
I
want
to
highlight
another
example:
functionality
that's
fairly
complex,
that
has
support
as
well
rolling
windows.
A
We
saw
an
example
of
that
very
briefly
that
I
mentioned
in
the
ku
DF
notebook
with
the
user-defined
functions.
We
can
do
rolling
window
operations
on
desk
as
well
and
just
like
before
it's
a
lazy
operation,
and
so,
if
I
don't
call
ahead
here,
it's
not
actually
going
to
execute,
but
I
will
call
head.
So
we
can
see
it
now
in
this
case.
It's
it's
incredibly
fast,
because
this
is
a
very
efficient
operation.
A
B
B
A
Question
so
will
not
do
that
it
just
it's
a
design
choice,
so
in
this
case
I've
already
persisted
the
D,
this
D
D
F,
which
is
why
this
was
particularly
fast
but
desk
will
not
cache
that
computation
if
I'm
doing
it
in
separate
cells.
If
I
do
this
already
working
in
a
Python
script,
it
would
actually
it
would
do
that
task
would
actually
not
recompute
the
calculation,
but
if
I'm
explicitly
calling
persist
and
compute
and
things
like
that,
it's
not
it's
not
going
to
catch
it
in
between
and
that's
a
that's
a
design
choice.
A
Question
so
because
we're
using
task-
qu
TF,
oh
sorry,
DDA,
because
we're
using
gas
CUDA
yet
coup,
D
F
and
we
call
persist,
we're
explicitly
putting
data
in
GPU
memory
and
actually
that
question
is
a
great
segue,
because
the
next
thing
I'm
gonna
do
is
show
that
we've
used
data
and
GPU
memory.
This
data
frame,
this
D
D
F,
as
well
as
the
arrays
we
created
above,
are
using
this
much
data
in
GPU
memory.
I
said
about
six
gigabytes.
A
That
was
my
estimate
before
perhaps
I
have
another
process
running
that
has
about
a
gigabyte
used
roughly
when
we
use
operations
with
qdf,
everything
is
being
persisted
into
GPU
memory
by
default.
In
order
to
use
CPU
memory,
we
have
to
explicitly
spill
the
CPU
memory,
which
is
something
we
enable
in
the
beginning
and
I
sexually.
What
I'm
going
to
show
right
now.
B
A
Great
question
so
I
chose
4
arbitrarily
there's
a
couple
things
in
play
here.
What's
what
gask
is
using
to
make
the
assessment
of
when
it
should
spill
is
multifaceted,
one
at
the
high
level?
It's
not
using
the
total
available
GPU
memory
to
decide.
Oh
I
only
have
twenty
seven
gigabytes
left.
Therefore,
I
should
spill
its
instead
using
the
size
of
objects
that
it
has
visibility
to
in
memory.
So
we've
done
a
bunch
of
compute
here
and
asked
has
got
visibility
into
the
things
we're
doing
the
data.
A
That's
not
in
memory
from
this
workflow
but
is
on
the
GPU
is
not
visible
to
desk,
and
so
it's
not
using
that
information.
When
thinking
about
spilling
that's
the
high
level
a
little
more
nitty-gritty
is
that
task
is
going
to
schedule
compute
in
a
way
that
it
thinks
it
can
do
it
in
an
efficient
manner,
and
so
task
will
go
over
that
memory
limit
that
we
set
as
long
as
it
thinks
it's
not
cannot
efficient
to
spill
for
that.
A
A
So
right
now
the
the
next
thing
I
was
gonna
show
was
sort
of
about
spilling,
and
so
you
can
see
in
this
case
there's
about
seven
gigs
using
the
GPU.
We
I
I
think
that
we've
used
about
six
of
them,
so
we
should
start
spilling
if
we
do
more
operations.
That
are
very
you
know
very
intense,
compute,
wise
such
as
this
one
right
here
so
notice.
This
operation
is
very
similar
to
one
before,
except
it's
bigger.
A
Instead
of
creating
a
100,000
by
1,000
array,
we're
gonna
create
a
500,000
by
1,000
array,
with
larger
chunks
and
so
again
right
here,
we've
actually
not
created
the
array.
We've
just
made
a
task
to
create
the
array,
and
so
when
we
do
this,
we'll
see
a
couple
of
things
and
I'm
gonna
switch
to
the
dashboard,
and
you
should
see
a
few
examples
of
these
things
that
look
like
disk
reads
or
disk
writes.
A
These
are
gonna,
be
examples
of
when
we're
spilling,
and
so
I'm
gonna
run
this
right
now
and
go
back
to
the
dashboard.
So
this
is
a
more
complex
workflow,
but
you
notice
that
there's
suddenly
these
yellowish
kind
of
goldish
tan
I'm,
not
Rachel.
You
know
I'll
go
with
gold
bars.
These
gold
bars
are
our
examples
of
when
we're
doing
spilling.
We
were
scheduling,
work
to
be
executed.
That
was
too
intensive
for
our
soft
target
of
four
gigabytes,
and
so
the
scheduler
said.
A
Okay,
you
know
what
I
need
to
do
this
computation,
so
I
have
to
temporarily
spill
some
of
my
memory
that
I'm
holding
into
CPU
memory
or
perhaps
into
disk
depending
on
how
we
configure
it.
In
this
case
it's
built
the
disk
and
then
once
it's
done,
it's
going
to
read
that.
So
you
see
right
here,
there's
a
you
know,
a
bunch
of
different
things
happening,
I'm,
just
gonna
quickly,
zoom
in
on
a
portion
of
it,
you
notice
that
okay
I'm
doing
a
dot
product,
and
so
I
need
to
do
a
right
step.
A
A
Gonna
use
way
larger
value,
so
you
can
see
a
significant
difference
in
you
know,
compute
time
with
one
GPU,
you
could
have
done
all
this
on
a
single
machine
or
a
single
GPU,
and
it
would
been
fine,
but
right
now,
instead,
I'm
gonna
use
all
the
GPUs
and
just
demonstrate
we
can
do
have
very
complex
calculation,
and
so
you
can
see
that
I
actually
have
16
GB
using
this
machine.
This
is
a
DG
x2.
A
It's
got
16
32
gigabyte
GPUs
that
are
connected
with
envy
links
and
actually
all
of
those
are
connected
with
an
env
switch.
I'm
gonna
create
a
cluster
using
all
of
them
and
I'm,
not
gonna
use
a
memory
limit
I'm
just
going
to
ignore
that
and
I'm
gonna
set
a
scratch
space
directory,
which
is
just
good
practice.
This
is
not
necessary,
but
I'm
just
gonna.
Do
it
to
be
nice
to
this
machine
I'm
not
going
to
have
scratch
space,
be
on
a
shared
file
system
and
put
them.
A
A
So
I'm
going
to
go
to
the
workers
and
you'll
see
that
it's
going
to
spin
up
16
workers,
so
we're
gonna
eventually
see
the
number
16
and
all
the
different
machines
coming
and
I'm
gonna
skip
this
right
here
and
just
go
all
the
way
back
down
to
the
very
bottom,
where
we
do
this
large
SVD
again
and
actually
I
have
to
import.
The
libraries
actually
I
think
we're
good
I'll
skip
down
to
this
once
this
is
ready
and
you'll
see
that
you
know,
we've
got
16
GPUs
here.
A
So
you'll
see
here
this
is
this
is
a
large
gray.
It's
not
enormous.
I
have
a
lot
of
GPUs,
but
it's
still
40
gigabytes.
This
is
too
much
data
for
a
single
GPU
right
now,
unless
you're
on
one
of
the
very
very
large
Quadra
GPUs,
and
these
are
big
chunks.
So,
let's
see
what
happens
if
I
actually
run
this
we'll
get
a
sense
of
really
what's
what's
going
on
here,
and
so
you
can
see
that
we've
got
a
lot
more
tasks.
A
We
have
700
tasks
and
these
this
operations,
it's
scheduling
these
tasks,
it's
then
gonna
run
them
and
it's
gonna
run
them
across
the
16
GPUs,
and
so
actually
this
one
is.
This
is
not
enough.
It's
too
fast,
but
you
can
see
it
did
this
whole
thing,
and
you
know
about
10
in
about
15
seconds
notice
that
it
was
transfers
at
the
end.
This
is
exactly
what
vibhut
was
mentioning
when
he
was
showing
that
example
of
how.
If
we
use
the
unified
communication
X
protocol,
you
see
X.
A
These
transfers
can
be
much
faster,
but
you
can
see
that
we
were
able
to
use
all
of
the
GPUs
to
calculate
a
very
large,
very
large,
matrix
decomposition
in
about
10
to
15
seconds,
which
is
great,
and
if
we
wanted
to
get
the
results,
we
could
just
get
them
there
right
there.
The
same
code
goes
one
GPU,
16,
GPUs
and
the
same
code
goes
multi
node.
A
If
we
had
multiple
of
these
machines,
that's
the
power
of
tasks,
and
so
hopefully
this
has
been
a
good
introduction
for
those
of
you
who
had
a
chance
to
go
through
this.
Hopefully
it
was
nice
to
see
it
running
on
multiple
machines.
There's
also,
you
know
a
lot
of
value
in
the
profile
and
again
it's
out
of
scope,
and
we
don't
have
enough
time
to
go
in
depth
about
the
profile
here,
but
in
general.
This
profiles
is
the
first
place
to
look
when
you're,
looking
at
workflows
with
tasks
and
understanding.
Where
is
time
being
spent?
A
Where
should
I
spend
my
time
optimizing
things
where
am
I,
perhaps
doing
things
inefficiently?
You
know
in
this
case,
if
we
were
the
all
the
developers
and
we
were
all
coming
together
to
say
the
most
important
thing
we
can
do
in
the
next
ask.
Release,
let's
say,
is
to
make
our
SVD
computation
better
and
more
efficient.
If
that
was
our
goal,
the
first
thing
we
want
to
do
is
understand.
Where
is
time
being
spent?
And
you
know
the
SVD
computation
is
all
happening
right
here.
A
I
hope
you
can
read
this
I
apologize,
it's
happening
in
these
dot
products
and
these
wrapped
QR
decomposition,
and
it
looks
like
about
20%
of
the
you
know
of
the
total
time,
and
this
workflow
was
spent
there
38
percent
there.
The
rest
was
on
data
generations.
Let's
not
worry
about
that,
but
it's
pretty
clear
that
we
spend
twice
as
much
time
doing
the
QR
decomposition.
A
Of
the
workflow-
and
it
turns
out
actually
a
lot
of
the
time
we
spent
in
the
QR
decomposition
was
in
not
just
the
actual
QR
operation.
It
meant
it
was
spent
doing
other
things,
perhaps
deserializing
serializing
data
and
other
things
like
that.
These
are
things
we
can
look
at
in
different
parts
of
the
profile,
in
particular
in
the
administrative
profile,
which
I
think
is
something
that
is
more
in
advanced
usage,
so
I'm
not
gonna,
go
into
it
now,
but
I'm
happy
to
take
questions
and
talk
about
that
later.
A
One
last
thing
provides
the
task
graph.
This
was
the
test
graph
notice
that,
as
Daboo
mentioned
earlier,
when
a
task
is
released
from
memory,
it
goes
blue
when
it's
held
in
memory
it's
in
red,
and
in
this
case
we
only
have
that
final
result.
Member
because
we
finished
our
tasks,
but
if
I
were
to
run
this
again,
you
could
actually
see
live.
This
will
update
and
we'll
see
how
things
are
going,
and
so
this
is
being
held
it's
because
this
is
the
stage
where
communicating
results,
presumably
and
okay.
Now
it's
finished
but
conceptually.
A
A
Question,
actually,
let
me
add:
shouldn't
have
gotten
rid
of
chrome.
We
actually
don't
have
a
break
coming
up
anyway.
So
that's
okay,
but
the
answer
to
that
question
is
very
little.
A
So
task
right
now
has
16
workers,
because
I
set
up
a
cluster
that
has
16
workers,
so
in
this
case
I
can
set
up
a
cluster
that
has
8
workers,
and
so
you
know,
I
can
do
this
manually
by
typing
out
the
GPUs
I
want,
maybe
I
don't
want
you
P
3,
maybe
I
only
want
GPUs,
0,
1,
2
and
I
want
you
to
use
a
7
8
9
its
etcetera.
This
is
how
you
would
set
that
up.
You
obviously
could
do
this.
A
B
A
A
good
question
so
currently
right
now,
local
CUDA
cluster
is
going
to
use
the
CPUs
available
for
spilling,
but
it's
not
gonna
use
the
CPUs
available
for
compute.
So
if
you're,
if
you're
thinking
about
a
world
in
which
we
have
these
eight
GPUs-
and
we
have
with
these
various
CPUs
as
well,
this
the
single
cluster
is
not
going
to
be
able
to
use.
For
example,
you
know
40
threads
per
CPU
core
and
the
a
GPUs
that's
a
more
advanced
operation
as
a
as
a
baseline.
A
Is
a
way,
a
good
question,
so
perhaps
there's
a
memory
error
because
there
was
slightly
more
data
than
you
know
could
fit
in
the
in
the
in
GPU
memory
or
the
computation
spiked
memory
slightly
more
than
expected
or
the
GPU
was
shared
or
there
were
already
objects
in
memory.
I
can
free
this
memory
explicitly
I'm
now
working
in
Jupiter
notebooks.
Many
of
you
probably
have
experienced
this.
It's
a
little
bit
more
difficult
to
free
memory,
because
Jupiter
holds
onto
references,
but
in
general
I
can
free
the
memory
associated
with
this
I.
A
Sorry
I
killed
my
kernel,
but
I
can
free
the
memory
associated
with
this.
You
array
that
you
know
we
know
is
fairly
large
or
the
x-ray
that
was
40
gigabytes
by
simply
calling
Dell
X.
Just
like
you
would
normally
do.
This
will
trigger
the
garbage
collection
for
Python
and
ask
now
because
it's
Jupiter
you
actually
might
need
to
also
delete
some
of
like
the
hanging
references,
but
in
general,
that's
how
you
did.