►
From YouTube: 8. Demo: Accelerating a Real Workflow
Description
From the NERSC NVIDIA RAPIDS Workshop on April 14, 2020. Please see https://www.nersc.gov/users/training/events/rapids-hackathon/ for all course materials.
A
So
in
this
segment
we're
going
to
talk
about
a
couple
of
things:
we're
going
to
actually
accelerate
a
real
workflow
and
I
believe
we
have
Taylor
Taylor
gross
in
attendance
and
we're
going
to
walk
through
the
workflow
that
he's
generously
allowed
us
to
to
use
as
a
guinea
pig
for
how
we
can
take
a
cpu,
workflow,
evaluate
it,
understand
it
and
convert
it
to
run
on
the
GPU
with
Rapids
and
then
get
some
serious
speed.
Ups,
so
Thank
You
Taylor.
A
B
Yeah,
yes,
this
workflow
and
nurse,
we
collect
a
lot
of
counters
on
our
systems
running
every
second
there's.
You
know,
probably
about
one
or
two
1,400
mm
counters
purse,
which
we
have
on
the
system
collecting
data
every
second.
That
gives
us
information
about
how
the
network
performed,
and
so
my
background
is
looking
at
our
high-speed
network
performance
and
trying
to
improve
that
so
loading
in
all
this
data
on
the
CPU
takes
a
lot
of
time.
A
Yeah
thanks
so
much
so,
ok,
so
with
that
we're
gonna
go
through
this
workflow
and
then
actually
do
the
port
life.
You
know
because
everyone
always
says
live
coding
is
always
a
good
idea,
so
we're
gonna
do
that
and
then
you
know
hopefully
get
some
brief
moments
during
us.
We're
gonna
take
some
stock
and
say
like
this
is
why
this
makes
sense.
This
is
the
way
we're
thinking
about
this
and
how
to
think
about
structuring
workflows
for
the
GPU,
because
often
it's
a
little
bit
different
than
thinking
about
the
CPU.
A
A
You
can
see
that
it's
got
some
standard
imports,
pandas,
multi-processing
timing,
numpy
cetera,
he's
also
written
some
custom
modules
and
four
functions
that
he's
using
to
do
that
analytical
work
and
that
processing
of
these
counts
to
make
it
human
informative
that
he
was
talking
about,
and
you
know
that's
something
we're
gonna
have
to
look
at
as
well,
and
so,
let's
you
know,
let's
just
actually
go
through
this
and
take
a
look
at
what's
happening
so
Taylor's
also
created
a
timer
object
to
help
us
understand
how
long
things
take,
which
is
very
helpful,
and
so
this
workflow
is
being
distributed
across
scores
in
the
system.
A
A
This
is
called
combined
quarry
data
frames,
and
then
we
want
to
parallelize
that
we
want
to
parallelize
that
with
the
multi
processing
pool
API,
which
is
the
standard
canonical
API
for
spreading
Python
work
across
multiple
processes,
each
one
using
a
different
core,
and
you
map
your
functions
with
the
pool
mapper,
and
then
you
have
to
make
sure
that
you
join
and
close
at
the
end
before
returning
your
data
to
synchronize
and
with
this
this
parallelized
essentially
just
takes
an
arbitrary
function.
It
looks
like,
and
it's
going
to
take
this
function.
A
A
We're
gonna
create
a
pool
of
that
many
processes
and
map
each
chunk
to
one
process
then
map
the
function
to
that
data
run
it
in
separate
processes
and
bring
it
back
together,
makes
perfect
sense,
and
then
this
function
run
on
subset
is
that
looks
like
a
way
to
actually
run
the
function
in
particular
running
a
function
in
a
way
that
looks
like
it's
gonna,
be
a
row
wise
operation
on
a
panda's
data
frame.
This
access
equals
one
is
a
giveaway
for
those
familiar
with
the
pandas
API
that
this
is
probably
gonna,
be
a
pandas.
A
A
data
set.
This
paint
this
data
subset.
We
don't
know
that
because
we
haven't
seen
it
yet,
but
it's
kind
of
a
giveaway
and
then
there's
this
wrapper
function,
which
is
a
wrapper
around
parallel
ice
to
allow
passing
arguments
to
this.
It's
using
the
func
tools,
partial
API,
to
pass
different
arguments,
and
things
like
that.
So
that's
that's
it.
So,
let's
actually
take
a
look
at
this
workflow
there's
a
sample
of
data.
This
is
not
terabytes
of
data
that
will
take
too
long
for
us
to
go
through.
This
is
a
sample
of
data.
I.
A
Think
it's
some
think.
It's
a
few
gigabytes
I
forget
how
much,
but
you
can
see
that
I
haven't
cleared
my
cell,
it's
gonna
be
three
and
a
half
million
rows
and
we're
gonna
read
this
into
memory,
so
it's
gonna
take
a
little
time.
So,
while
this
is
going,
I'm
gonna
explain
you
know:
what's
gonna
come
next
in
this
workflow,
so
we
look
through
this
and
we
saw
that
the
first
thing
taylors
does.
Is
he
source
the
data
by
time?
A
Okay
makes
sense
sorting
by
time,
then
we
can
take
a
look
at
the
data.
This
data
looks
like
this.
It's
got
a
time
column.
It's
got
this
column.
That
anecdotally
I
know,
is
gonna,
be
about
these
different
systems
and
getting
counters
and
use
these
things,
but
I'm
not
well
versed
in
this.
We
work
with
Taylor
get
a
better
understanding.
Then
we
get
an
understanding
of
what's
actually
going
on.
So
in
this
case
the
key
aspect
here
is
that
there's
800
columns
and
there's
you
know
three
and
a
half
million
rows.
A
So
this
is
a
large
amount
of
data
to
process.
We're
gonna
do
this
kind
of
counting
for
every
row
and
we
have
millions
of
rows.
We
have
hundreds
of
columns
and
the
counting
logic
is
fairly
complex
actually,
but
we
also
see
that
there's
an
identifier
from
which
router
this
came
from
just
looks
like
a
hash
or
something
so.
Taylor
has
provided
some
examples
of
doing
this.
A
So
you
know
with
one
process
we're
gonna
test
this
with
different
numbers
of
rows
for
I
and
range
1,
2
3,
so
1
or
2
we're
going
to
go
either
10
rows
or
100
rows,
because
it's
a
is
10
to
the
I
and
then
take
a
sample
and
try
this
to
see
how
it
scales
and
the
rest
of
this
code
is
about.
Estimating
that
so
we'll
run
this
and
we'll
see
it
took
about
10
rows
for
one
one
process
could
act
could
do
this
aggregate
VC's
function,
that's
being
paralyzed
in
about
1.6
seconds.
A
So
naively,
if
we
have
one
process
doing
it
in
1.6
seconds,
we
are
really
concerned
are
already
because
we
have
millions
of
rows
and
we
see
now
we
have
linear
scaling,
because
when
we
scaled
up
to
100
rows,
it
roughly
is
about
10
times
slower
1.60.
You
know,
let's
call
that
1
and
1/2.
Let's
call
this
15
we're
scaling
linearly
in
processes.
A
If
we
have
4
processes
with
10
rows,
it
took
maybe
half
a
second
for
process
100
rows
for
seconds
the
estimated
time
on
this
machine
to
process
all
the
rows
at
this
rate
is
2
and
1/2
hours.
In
this
case.
That
is
too
long,
because
this
is
a
sample
of
the
data.
We
can't
wait
two
and
a
half
hours.
We
have
to
speed
this
up.
We
know
that
that's
the
goal.
So,
let's
see
what
happens
next.
This
is
the
second
part
of
the
workflow.
We
had
this
aggregate
VCS
function,
which
we
don't
really
understand.
A
We
have
this
other
function,
which
we
also
know
does
something
on
a
row
by
row
basis,
because
we
have
parallel
eyes
on
rows
and
we
can
see
how
long
this
is
gonna
take
I
think
we
have
some
intuition
that
this
might
be
fairly
time-consuming
and
we
can
see
that
in
general
we're
getting
some
good
speed
up
when
we
go
to
more
processes.
For
this,
a
thousand
rows
only
took
four
and
a
half
seconds.
A
thousand
rows
took
one
and
a
half
seconds,
and
even
one
second,
with
eight
processes.
A
So
this
actually
runs
a
lot
faster,
but
it
still
might
benefit
from
speeding
up.
So
at
this
point,
we've
got
a
sense
of
the
workflow.
We
see
that
there's
a
clear
bottleneck
right
here,
but
we
don't
really
understand
it.
So
now
our
job
is
to
say:
what's
going
on
in
these
functions,
we've
gotten
a
sense
of
what's
happening.
We
know
that
the
output
we
want
is
this
temp
color
data
frame,
and
it's
created
a
bunch
of
new
information
columns
in
this
data
frame.
We
had
800
columns
before
roughly
400
43.
A
Now
we
have
almost
1100
columns,
so
we've
created
a
lot
of
new
information
at
the
end
of
this
data
frame,
which
is
the
goal
lots
of
good
stuff.
So
at
this
point,
we're
ready
to
say
well,
what's
going
on
in
these
functions,
we
noticed
that
sorry
lost
my
place.
All
of
these
functions
were
using
come
from
this
ldms
PP
module.
A
It's
the
same
one,
it's
the
same
one.
Let's
take
a
look
at
this
ldms
pp
module,
we're
importing
it.
Let's
take
a
look,
and
please
let
me
know
if
you'd
like
me,
to
make
the
font
larger
I'm
happy
to
do
that,
because
I
can
see
it's
a
little.
Maybe
let's
do
it
anyway,
so
this
is
a
module
with
a
lot
of
functions,
figuring
their
own
module,
your
insular
yes,
I
am
thank
you
thanks
Taylor.
So
this
is
a
function.
This
is
a
module.
A
A
What
goes
into
this
so
we'll
find
this
function,
and
here
we
go
so
this
is
in
fact,
there's
actually
two
versions
of
this.
This
this
one
was
10%
faster.
So
this
is
the
one
that
was
part
of
the
workflow.
So
there's
a
lot
of
stuff
going
on
here,
so
it's
a
fairly
large
function.
So,
let's,
let's
try
to
unpack
this.
A
We
know
it's
operating
on
pandas
dataframes,
so
we
know
that
it
has
some
implicit
structure
about
how
it
does
processing
and
that
processing
is
on
a
row
by
row
basis.
So
this
row
is
the
unit
of
account.
Essentially,
this
row
is
going
to
have
attached
to
it
all
of
the
columns
and
the
data
frame
for
that
given
row,
that's
sort
of
that's
just
the
way
that
we
apply
functions
on
pandas
dataframe.
So
we
know
that
so
we
can
see.
A
There's
some
loops
here,
there's
actually
a
nested
loop,
so
we're
doing
something
five
times
and
then
for
every
one
of
those
things
we're
doing
another
thing
eight
times
so
we're
doing
something
40
times.
What
are
we
doing?
Well
we're
creating
some
variables
and
initialize
them
to
be
zero.
These
are
the
Flitz.
A
We
are
also
creating
some
strings
and
we're
gonna
create
the
strings
based
on
where
we
are
in
these
loops.
It
turns
out
okay,
so
we
know
that
the
looping
logic
is
important
for
some
kind
of
strings
we're
creating.
Then
we
see
the
same
thing,
but
it's
now
four
stalls
rather
than
Flitz,
so
it
looks
like
F,
and
s
are
these
indicators
that
are
prepended
on
these
things,
to
tell
us
what
we're
working
on
and
it's
the
same
kind
of
information.
A
You
know
there's
different
things
that
are
being
created
and
these
look,
like
you
know,
incoming
packets
versus
incoming.
You
know
flit
some
I'm
not
well
versed
in
this
domain,
but
it
seems
like
we've
kind
of
sense
on
what's
happening,
we're
creating
some
labels
and
we
probably
will
do
something
with
them.
So
then
we
get
to
some
more
logic.
You
know
for
every
one
of
these
iterations
every
time,
for
every
set
of
these
things
that
we've
created
we're
going
to
loop
through
and
grab
the
value
from
this
column.
A
So
this
column
is
presumably
in
the
data
frame
and
we're
gonna
grab
the
value.
It's
a
numeric,
because
this
data
frame
was
numeric
and
add
it
to
this
total,
which
we
have
up
here
so
we're
doing
a
sum
across
several
columns,
in
this
case
four
columns
and
we're
defining
which
columns
were
using
based
on
this,
our
value
to
C
value
and
this
VC
value
in
these
loops.
A
Okay.
So
this
is
a
binary
operation,
we're
doing
a
binary
operation
and
then
we're
doing
a
reduction,
we're
doing
a
sum.
Essentially
we're
just
crab.
Sorry
we're
doing
sorry
we're
doing
a
reduction
operation.
It's
this!
It's
a
something,
we're
grabbing
it
and
then
we're
doing
a
sum.
The
total,
after
the
by
after
the
binary
operation
of
the
addition-
and
so
we
know
we're
just
doing
addition
and
summation.
So
that's
pretty
good.
So
far,
we've
got
a
sense
of
what's
happening.
We
see
it's
happening
again
here.
A
The
same
thing
addition
summation
binary,
ops,
reductions,
those
are
great
GPUs
love
them,
then
we're
doing
some
more
addition.
Work
looks
like
we're
after
we've
gone
through
these
we're
going
to
combine
the
vc
rec
and
the
VC
RESP
to
create
the
overall
Flitz
VC
X,
and
then
the
stalls
VC
X
and
we're
gonna
do
something
with
these.
Presumably,
okay,
here's
the
answer.
Those
labels
we
created
up
here.
We
created
them.
A
We're
gonna,
make
a
new
column
in
the
data
frame
and
we're
gonna
put
that
sum
that
we
calculated
up
here
or
we
initialized
up
here,
and
we
add
a
two
right
here:
we're
gonna
make
a
new
column
and
put
that
sum
in
there
and
we're
gonna.
Do
it
a
bunch
of
times?
Actually
we're
gonna.
Do
it
for
that
one
for
this
one
and
we're
gonna
do
per
then
we're
gonna
do
the
same
thing
for
the
stalls.
We're
gonna
take
this
stall
some
this
one
in
this
one.
A
A
Here
we
go
get
per
router
counters
bicolor,
we're
gonna,
take
in
a
row
just
like
before,
and
we're
going
to
initialize
some
things
just
like
before
some
some
counters,
presumably
and
then
we're
gonna
loop
through
this
index
of
the
row,
and
so
in
this
case
the
index
of
the
row
is
gonna,
be
all
the
different
identifiers
associated
with
it.
So
what
this
is
really
saying
is
we
want
to
loop
through
and
evaluate?
Oh
I.
Don't
have
it
handy
here,
but
it'll
be
on
this
one?
A
We
want
to
evaluate
this
router
ID
here
this
c7
v,
c1
blah
blah.
That's
what
this
index
is
and
then
actually
sorry
I.
So
that's
what
the
name
is
I
think
that's
what
the
name
is
and
then
the
index
is
going
to
be
the
actual
the
actual
columns
themselves,
so
we're
gonna,
eight
all
of
the
columns.
That's
the
index,
getting
my
pandas
logic,
mixed
up,
we're
gonna,
evaluate
all
the
columns
and
loop
through
them
and
then
so
for
every
column,
we're
gonna
run.
This
get
tile
number
function
on
it.
So
that's
gonna
give
us
something.
A
This
is
a
fairly
clear
pattern.
We're
gonna
capture
a
pattern
of
a
C,
followed
by
some
number
of
digits,
then
a
dash,
then
some
some
arbitrary
number
of
digits,
then
a
excuse
me
then
a
C
and
it's
more
digits
and
an
S
and
so
on,
and
then
we're
going
to
see
if
we
can
match
this
pattern
to
that
name-
and
the
name
is
this:
this
is
the
thing
we're
trying
to
match
against
right
here
these
things
and
it
makes
sense.
A
We
see
that
there's
the
letter,
the
number
the
dash
and
so
on,
so
that
checks
out
we're
gonna.
Do
this
and
we're
capture
that
last
match
the
8th
group
1
2
2
3
4
5
6
7
8.
So
we
want
that
final
one,
we're
gonna
call
that
slot.
Ok,
so
you
can
see
our
already.
This
is
a
fairly
you
know:
complex
amount
of
stuff
happening
it
taking
a
little
time
to
understand.
A
So
now
we
get
to
some
branching
logic:
we've
got
our
slot
and
if
these
teams
that
we
created-
which
we
don't
understand
yet-
but
if
this
tile
number
is
within
certain
conditions,
there's
some
branching
logic.
We're
gonna
do
things.
If
this
is
the
case
and
if
this
column
has
flit
vcx
in
it,
you
know
this
one
doesn't
have
flit
vcx.
Neither
does
this
one.
Neither
does
this
one,
but
some
of
them
do.
Presumably
if
it
has
flit
vcx
in
it,
we're
gonna
do
some
counting,
we're
gonna.
C
A
Locked
all
numbers,
but
they're,
not
super
important
from
the
GPU
perspective.
We
can
just
have
the
same
branching
logic
with
the
octal,
but
we're
gonna
have
a
new
condition
right
now.
This
is
again
a
condition
that
looks
like
it's
got.
Another
sub
condition
is
little
more
complicated.
It's
using
that
slot
that
we
just
had
so
we're
gonna
say
if
slot
is
less
than
eight
and
flits
are
in
the
counter
face.
Flip
vcx
is
in
the
column
name.
Add
it
to
green
okay,
so
we're
getting
a
sense.
A
The
color
here
is
important
same
thing
with
stahls
and
then
we're
gonna
add
it
to
black
if
slot
is
greater
than
or
equal
to
eight,
because
it's
the
else
statement
like
this
a
fluke
so
we're
getting
a
sense
of
what's
happening.
This
logic
then
continues
where
this
branching
logic
comes
with
this.
So
if
this
is
not
true,
then
we
go
to
the
else,
and
then
we
do
the
same
logic,
but
for
different
things.
At
this
point
then
we're
gonna
return.
A
It
looks
like
some
new
columns
that
we're
calling
router
flips
black
router
flats,
blue
flatscreen,
etc,
and
these
are
just
the
sums
that
we've
calculated
so
at
this
point,
I
think
we
have
a
pretty
good
handle
on
this
workflow
we're
creating
new
columns
that
are
summing
up
the
counters
from
all
these
different.
You
know,
sensors
or
routers,
or
that
are
coming
through
the
system
and
spitting
out
the
results.
This
is
actually
just
like
Taylor
explained
it's
great
when
it
works
out
like
that,
and
so
the
next
step
we
would
have
is
well.
A
We
know
it's
slow,
but
why
is
it
slow?
It's
almost
always
the
right
approach
to
start
profile
profiling.
Your
work
float
that
exists
already
before
doing
anything
else.
So
I'm
gonna
do
that
I'm
gonna
use
a
tool
and
I'm
zooming
out
just
for
a
moment,
so
I
can
make
this
cleanly.
I'm
gonna
create
a
Python
script.
To
help
me
profile.
This
and
I'm
gonna
make
a
tool
that
I'm
gonna
use
a
tool.
Excuse
me
called
snake
vis,
which
I
highly
recommend
for
those
who
are
not
familiar
with
it.
A
If
the
tool
for
visual
profiling
and
it
does
work
in
Jupiter
notebooks,
but
it
will
be
cleaner
to
do
it
in
a
Python
script,
so
I'm
gonna
do
it.
Do
it
here
I'm
going
to
actually
of
not
use
any
of
his
parallelism
because
it
doesn't
actually
affect
the
profile.
It's
just
gonna
cut
it
I'll
capture
it
overhead,
but
it's
gonna
make
profiling
very
hard
to
its
multi-process.
So
I'm
not
gonna,
run
this.
A
Live
coding
and
I'm
gonna
use
this
to
do
the
same
sort,
so
obviously
I
want
to
make
sure
we're
on
the
same
workflow
and
I'm
not
gonna
run
this
multi-processing.
Instead,
what
I'm,
just
gonna
do
is
run
the
function
I'm
just
gonna
run
it
I
mean
that's
very
fine
with
me.
So
I'm
gonna
take
these
right
here
now,
I'm,
just
gonna
run
them
I'm,
going
to
not
run
it
on
here.
I'm.
Actually,
just
gonna
literally
apply
this
to
my
data
frame
and
say
that
temp
got
apply.
A
A
A
D
Hey
NIC
sard
interrupts,
who
lost
your
sound?
Can
you
hear
me
is.
D
A
Sorry,
my
headphones
are
acting
up.
Apologies
for
that
I'm,
not
sure
when
I
went
out,
but
what
I'm
gonna
do
here
is
profile
the
workflow
using
something
called
C
profile
and
then
visualize
it
in
snake
this,
which
is
a
library
that's
for
visual
profiling.
The
C
profiles
bit
is
big,
is
baked
into
Python
I'm
gonna
save
the
result.
As
this
thing,
I'm
gonna
run.
This
and
I
have
to
activate
my
environment.
A
Now
I'm
gonna
run
this
now
that
I
have
pandas
available.
This
is
I,
guess
a
good
lesson
in
making
sure
you're
you're
in
your
kondeh
environments
that
you
think
you're
in
so
this
is
gonna
run,
and
you
know
it's
gonna
take
a
little
bit
of
time
to
run
not
too
much,
but
we
know
that
there's
a
lot
of
data
being
read
by
the
pandas
data
reader
right
now,
so
maybe
I'll
take
30
or
45
seconds.
While
this
is
running,
I'm
gonna
show
you
what
snake
visits.
A
A
So
this
is
what
I'm
actually
going
to
do
now.
I,
don't
need
this
dashboard
anymore.
This
is
gonna
be
done
in
a
moment.
What
I'm
gonna
do
once
this
is
finished,
is
basically
run
this
exactly
and
so
I
am
going
to
take
the
results
of
this
this
profile.
Okay,
so
it's
still
going
it's
taking
taken
a
bit
of
time
to
sort.
Then
it's
just
gonna
run
the
this
apply,
then
this
apply
and
it
should
be
done
shortly.
Hopefully,
I
guess
this
bakes
into
the
point
that
this
is
a
you
know
we
need
to.
A
D
A
A
great
question,
so
it
is
informative
and
it's
non
informative
for
necessarily
the
reason
of
why
it's
gonna
tell
you
exactly
where
to
go
for
the
GPU
or
you
know
where
you're
gonna
spend
the
same
amount
of
time,
because
to
your
point
you
might
not,
but
it's
very
informative
for
thinking
about
how
to
attack
the
problem.
It's
possible.
This
profile
will
tell
us
that
certain
things
are
not
necessary
for
a
first
pass
to
put
on
the
GPU.
You
know,
for
example,
right
now.
We
are
time
boxed.
A
A
Awesome,
this
is
probably
my
fault.
I
should
have
made
this
more
like
ten
or
a
hundred,
because
this
is
why
it's
taking
so
long.
We
saw
that
this
thing
scaled
linearly
and
we
saw
that
one
doesn't.
This
is
gonna.
Take
a
thousand
seconds.
I
didn't
think
this
through
sorry,
I'm
gonna
kick
this
off
again
using
ten
records,
and
so
we'll
have
to
wait
a
little
bit
longer.
But
so,
while
this
is
happening,
I
will
go
back
to
I'll,
get
a
set
up
to
do
a
rapid
version.
A
While
this
is
being
profiles,
so
in
the
rapids
version,
we're
gonna
want
most
of
the
same
things,
because
you
know
we
are
gonna,
need
these
things,
probably
but
I'm
not
gonna
grab
them
from
the
same
libraries
I'm.
Instead
gonna
copy
them
from
this
script
and
bring
them
up.
We
know
we
need
this,
get
routers
by
color,
so
I'm,
just
gonna
put
it
here.
So
it's
easier
for
us
to
see
it.
You
know
we
need
that.
We
know
we
need
this.
Other
function
get
tile
number
and
there
it
is.
A
A
We'll
call
these
the
original
functions.
We
know
we
need
these.
We
also
know
we
need
that
aggregate
VC's
function,
which
is
right
here.
So
these
are
our
functions.
We,
these
are
the
things
we're
operating
with
okay
there.
So
this
this
finished
great.
So
you
can
see
now
that
I
have
a
profile
right
here.
I
have
a
initial
work
flow
dot
proof.
This
is
a
C
profile
and
it's
it's
fairly
scary,
to
look
at
them
if
you're
not
experienced
with
it,
but
it's
actually
a
very
specific
structure.
A
It's
very
straightforward
once
you
understand
it,
which
is
that
every
line
is
essentially
part
of
the
call
stack
of
everything,
that's
happening,
and
it's
just
measuring
time
in
the
call
stack.
So
what
I'm
gonna
do
now?
Is
you
snake
this
too,
to
visualize
this
profile?
And
so
what
I'm
gonna
do
is
visualize
this
initial
workload,
Prof
I'm,
gonna,
sorry,
I'm,
gonna!
A
Not
do
that
I'm
going
to
pass
in
some
configurations
that
determine
what
ports
to
use
and,
what's
going
on
with
the
look,
this
is
the
basically
saying
the
local
host
on
port
8080
and
it's
gonna
give
me
a
webserver
and
so
I'm
going
to
go
to
this
web
server,
and
it's
happening
right
here.
For
me,
this
is
our
profile.
A
So
snake
vis
is
going
to
make
that
icicle
plot
for
us,
so
it's
turned
about
50
seconds.
Most
of
the
time
was
reading
data.
That
makes
sense.
This
was
a
very
large
data
set.
We
didn't
do
that
much
compute,
but
this
is
also
a
fixed
cost,
so
I'm,
not
that
concerned
about
it
right
now,
I
want
to
see
where
we
did
things
okay.
A
A
A
That's
consistent
with
what
we
expected
from
the
panda's
version
that
we
saw
this
this
function
scale
very,
very
slowly,
but
now
we
can
see
clearly
that
this
is
scaling
slowly
because
of
all
this
set
item
calls
and
in
particular,
we're
using
these
set
item,
calls
to
create
new
columns.
So
the
creating
columns
is
what's
taking
really
a
long
time
here,
and
so
that
gives
us
insight
into
this
workflow
in
general.
But
really
the
key
thing
here
is
that
the
next
30
minutes,
let's
not
focus
on
get
per
router
counters.
A
We
know
that
eventually
we
can
do
this,
because
we've
actually
done
this
before
we've
done
this
in
the
past,
but
let's
focus
now
on
aggregate
VC's,
because
98%
or
95%
of
the
time
in
this
tiny
example,
and
perhaps
even
larger
in
the
real
example
99.5.
Probably
if
we
did
this
at
scale,
is
spent
in
this
function.
A
So
now
we
know
that's
great.
We
can
kill
this
profile,
we're
good
to
go,
we're
ready
to
start
so
in
this
case
we're
gonna
still
import.
This
data
set
with
pandas
because
it's
a
fixed
cost
and
the
pandas
hdf5
reader,
is
what
we're
gonna
use.
Cuz
qu
DF
doesn't
get
support
hdf5
reading
directly
to
GPU
memory,
so
we're
gonna
read
the
data
in
just
like
before,
and
so
we
can
get
a
little
more
setup.
While
that's
happening
again,
we
could
sort
this
on
the
GPU.
We
saw
that
sorting
is
very
fast
in
the
GPU.
A
But
again
it's
not
super
important,
because
these
are
fixed
costs,
and
so
we
can
just
run
them.
I'm
gonna
switch
this
to
not
be
an
in-place
operation,
but
still
run
the
same
sort
and
when
we're
developing
this
workflow,
we
don't
want
to
work
on
this
total
data
set.
This
is
a
large
data
set.
We
saw
it
was
three
million
three
and
a
half
million
rows.
We
want
to
work
on
a
sample
and
so
I'm
gonna
call
this
quarry
sample
and
I'm
going
to
just
take
the
first.
Maybe
ten
rows.
A
I'm
gonna
make
a
copy
that
way.
We
don't
end
up
actually
mutating
the
original
data
set.
It's
very
it's
very
often
the
case
that
we
accidentally
mutate
datasets
and
that's
what
causes
that
setting
with
copy
error
that
I
mentioned
earlier
in
the
in
the
day.
So
I'm
gonna
take
a
sample
of
this
and
I'm
also
going
to
have
to
import
cootie
F,
because
I
want
to
do
this
in
the
GPU
and
I'm
going
to
maybe
I'll
corn
I'll
call
Corey
sample
GPU
as
qu
DF
from
pandas
I'll.
A
There
we
go
so
this
is
again.
This
is
going
to
be
quick.
It's
going
to
be
oh
right,
okay,
so,
and
support
this
time,
Delta
timestamp,
which
is
the
way
that
pandas
wrote
read
in
this
data.
We
do
support
date
times.
We
don't
support
this
time
Delta,
so
we
can
find
that
column.
This
time
Delta
we
weren't
using
this
duration
column
in
the
work
flow.
So
we
can
temporarily
drop
this.
We
could,
of
course,
just
cast
it
to
a
different
time.
We
could
do
different
things
with
this.
A
We
could
change
the
structure
to
be
instead
of
a
time
Delta.
We
could
use
a
start
and
end,
but
for
now
I'm,
just
gonna
I'm,
just
gonna,
get
rid
of
it
and
say
in
a
drop
duration.
Axis
equals
1.
Now
this
is
the
same
way:
you'd
drop
a
column
in
panda,
so
it
should
look
fairly
familiar,
and
now
we
can
put
this
on
the
GPU
and
I'll
call
this
quarry
sample
GPU.
A
A
A
A
These
are
independent
parrot
reductions
that
could
be
done
in
parallel
and
or
binary,
ops
and
reductions
that
could
be
done
in
parallel,
so
I
suspect
that
there's
a
way
we
can
take
this
logic
and
instead
of
having
it
work
on
a
row
basis,
we
could
have
it
work
on
a
column
basis
and
I
suspect
that
would
actually
apply
both
to
the
CPU
and
the
GPU.
That's
like
the
suspicion
from
looking
at
this
function,
that's
sort
of
like
the
first
way.
A
So,
let's
define
a
function,
justify
an
aggregate
VC's,
but
let's
friend
colander
aggregate
feces,
and
we
don't
want
it
to
work
on
the
rail.
We
want
to
work
on
the
entire
data
frame
so
that
this
is
what
we
want.
So
we
we
kind
of
have
a
sense
of
what
we
want
to
do.
Let's
operate
on
columns,
not
rows.
That's
like
this
loose
sense
that
we
have,
and
so
we
want
to
sum
across
rows
across
specific
columns
in
each
row.
Actually,
that's
what
we
want
to
do
right.
A
We
want
to
put
into
this
new
column
a
bunch
of
different
information
here
that,
if
coming
from
these
four
columns,
then
coming
from
these
four
columns
that
are
defined
by
these
loops,
so
we
want
to
operate
on
columns.
We
want
to
sum
across
specific
columns
in
each
row
and
we
can
probably
do
this
in
a
couple
of
stages.
A
A
single
row,
wise,
binary,
op
and
reduction
addition
plus
some,
and
so
this
in
theory,
it
seems
like
we'll
avoid
a
lot
of
the
pain
that
that's
coming
from
all
these
repeated
calls,
and
so
let's
try
to
do
that
so
with
Garrity's
past,
so
we
still
need
to
generate
the
same
columns.
We
still
need
to
generate
the
same
output
columns.
We
need
to
do
that,
so
we
probably
can't
avoid
this
loop
now.
A
The
first
thing
about
this
is
it's
you're
gonna
want
it's
common
for
myself
and,
for
others,
probably
to
say
loops
are
our
bad
loops
are
time-consuming?
Can
we
avoid
loops
and
of
course
we
want
to
avoid
loops
where
we
can,
but
sometimes
we
just
need
loops,
and
so,
when
I'm
looking
at
these
workflows
and
thinking
about
how
to
profile
and
how
to
port
them,
we
don't
necessarily
have
to
move
the
waterfall.
All
at
once,
we
can
move
the
waterfall
inch
by
inch,
and
so
we
can
start
by
saying
probably
need
these
same
output
columns.
A
So
we
might,
we
might
need
these
loop
still.
So
let's
keep
the
loops
and
you
know
maybe
we
need
this
try.
We
can
still
wrap
these
in
a
try,
except
if
in
case,
there's
an
error,
but
no
that's
that
perhaps
is
not
necessary,
but
it
might
be
so
we
probably
need
to
do
a
lot
of
the
same
stuff.
You
know
we
can
start
with
the
Flitz.
So
let's,
let's
start
with
the
Flitz,
and
we
might
need
all
the
same
information
let's
find
out.
Well,
we
know
we
want
to
create
these
output
columns.
A
So
we
probably
we
probably
need
these
still
that's
what
we're
creating.
So
these
make
sense.
But
do
we
need
these
counters?
Do
we
need
to
initialize
a
counter
to
zero?
If
we're
gonna
do
a
single
row
wise,
binary,
op,
this
binary
op,
we
can
use
the
pandas
API
or
the
cou
DF
API
to
do
DF
columns
of
interest.
A
We
can
do
that.
Some
row
wise.
This
is
what
we
can
do,
so
we
probably
don't
need
to
initialize
these.
We
probably
instead
want
to
think
about
it
a
little
differently.
We
definitely
need
these
columns
I
feel
like,
but
we
want
to
get
the
columns
we
need
instead
of
initializing
one
of
the
time.
Let's
get
the
original
columns,
we
need
for
Flitz
installs,
so
we
these
kind
of
separated
again
you
have
to
do
them
separately
for
all
these
different
things.
Right
like
we
need
this
freak
label
to
be
this
wreck
thing.
A
So
we
know
that
we
probably
are
gonna
want
this
loop
still,
but
we
want
to
do
it
differently.
We
need
the
columns
of
intro
to
create
these
columns.
So
maybe
what
we
can
do
is
say,
get
all
the
columns
at
once
and
add
them
to
this
list
and
then
we'll
do
the
same
thing
for
for
this.
This
is
that
stalls
VC
rec,
so
we'll
do
the
same
thing
there
and
also.
Please
stop
me.
If
there's
any
questions,
if
then
or
if
Lori,
if
you
think
you
can
save
them,
that's
fine
too.
A
A
Append
this
column
and
I
guess
I,
probably
wanna
actually
append
them,
rather
than
just
do
it.
So
this
is
gonna.
Do
the
same
thing
except
it's
just
gonna,
it's
not
gonna.
Do
the
computation
is
just
gonna
collect
all
the
columns
we
care
about
so
okay.
That
makes
sense.
So
we've
done
something
good.
There
I
think,
and
we
probably
need
to
do
this
again,
though,
because
we
had
a
second
loop.
A
So
we
have
this
other
loop
over
here
and
we
probably
have
to
do
it
again,
because
we
need
all
the
information
to
be
the
same,
but
so
we
probably
should
create
some
more
columns.
We
should
probably
create
the
Flitz
VC
RESP
columns
and
then
do
the
same
thing
for
SVC
respite
and
we
probably
again
will
want
to
do.
These
are
pens
instead.
So
we'll
probably
do
got
append
we'll
put
this
in.
A
And
then
do
this
for
this
stalls,
BC,
reps
and
so
all
I'm
doing
is
just
adding
not
that
all
I'm
doing
is
just
adding
these
to
a
list
to
keep
track
of
them.
So
at
this
point
we
actually
haven't
again:
we've
done
no
computation
that
we've
collected
the
things
that
for
a
given
iteration
of
this
loop,
we
want
to
sum
across.
So
that
sounds
good.
That's
actually
really
good,
and
so
now.
C
A
A
Crap,
sorry,
oh
sorry,
bad
words
anyway,
so
we're
starting
with
the
aggregate
VC's.
This
is
our
original
function
that
I
accidentally
just
deleted,
and
so
we
can
just
append
which
is
in
place,
and
so
that
was
even
simpler
and
now
we
can
probably
do
the
sums,
but
we
don't
want
to
do
it
inside
these
loops.
You
want
to
do
it
only
where
we
need
it,
which
is
the
same
level
of
this
try.
So
let's
do
the
sums
and
maybe
we'll
put
in
the
accept
and
just
pass
or
or
whatever.
A
A
This
is
original
information
that
we
kind
of
have
to
capture,
and
so
we
want
to
add
it
to
these
rows
and
so
f,
which
is
Flitz
so
Flitz
vc
req
is
gonna,
be
this
freak
label,
so
we
can
probably
say
that
DF
freak
label
because
we've
defined
the
freak
label
up
here.
We
still
have
this,
so
we
can
do
all
the
rows
at
once
and
say
that
DF
freak
label
is
probably
just
going
to
be.
A
The
sum
of
all
the
columns
for
Flitz
VC
wreck
these
row
why's
that
that
looks
pretty
good,
so
we
can
probably
do
the
same
thing
for
Flitz
RESP
label,
except
we,
of
course,
will
have
to
use
the
correct
columns,
the
flips
of
VC
RESP,
and
so
that
looks
pretty
good
to
me.
We
also
have
this
new
label
one
though
so.
This
new
label
is
the
combination
of
flips,
VC,
rec
and
flips.
We
see
rep
now.
A
Presumably
I
could
actually
now
just
do
this
by
summing
this
thing,
I've
created
and
with
this
thing,
because
I've
done
that
work
already
and
in
this
case
I'm
gonna
just
make
a
little
more
explicit,
but
we
can
probably
optimize
this
so
I'm
just
gonna
be
more
explicit
and
say
this
is
just
the
some
of
these,
because
FV
CX
is
the
sum
of
both
of
these
and
that's
what
this
is
gonna
be.
So
this
is
the
sum
of
both
of
these.
A
Now
again,
we
could
optimize
this
by
using
these
already
computed
sums,
but
for
the
sake
of
it,
we'll
just
keep
it
for
now,
and
so
now
we
have
to
do
the
stalls.
So
we've
got
these,
so
we
can
probably
do
the
same
thing
because
it's
consistent
logic.
So
we
know
we
need
this
stalls.
Req
label,
which
we've
created
up
here
just
like
before,
and.
A
A
We
know
that
again,
we're
gonna
want
to
have
two
gonna
have
to
go
to
the
stalls,
VC
RESP
for
the
second
row,
and
then
the
combined
is
probably
gonna
be
the
same
as
before
as
well,
except
we're
gonna
use
the
stalls
version
so
we're
gonna
put
in
the
stalls,
VC
wreck
and
the
stalls
we
see
rasp.
So
this
is
looking
pretty
good.
A
This
is
the
same.
The
same
computation
we've
done
it
in
a
way
that
is
not
operating
row,
wise,
we're
operating
column
by
column,
so
we're
using
the
entire
data
frame.
It
wants
to
take
advantage
of
pandas
and
really
num
pies,
built-in
factorization
and
the
reason
I
started
like
this.
Is
you
know
you
might
notice
that
nothing
about
this
looks
like
it's
on
the
GPU.
The
beauty
of
Rapids
is
that
this
isn't
code
that
is
specific
to
the
GPU
I'm,
writing
generic
PI
data
code,
and
so
hopefully,
at
the
end
of
this.
A
At
the
end
of
this
double
for
loop,
which
I
am
now
at
the
same
level
of
I'm
gonna
return,
my
data
frame
and
so
I
hope
that
when
I
run,
this
I'm
gonna
get
the
same
results
as
when
I
do
this,
and
so,
let's,
let's
take
a
look
and
see
what
happens
so
you
know
I'll,
just
you
make
it
even
smaller
sample.
Maybe
I'll
take
the
two
rows.
This
is
our
original
function.
You
know
we
can
see
that
we
added
a
bunch
of
things.
Let's
take
our
new
columnar
version,
it's
very
likely.
A
We
made
a
mistake
and
it's
very
possible.
We
made
a
mistake
because
when
you
pour
things
in
a
you
know
interactively
and
iterally
iteratively,
you
often
make
mistakes.
So
this
you
know
this
function
just
takes
in
a
data
frame.
That's
all
it
takes
in
stock
string
sort
of
explains
the
logic
that
we
try
to
to
do
we'll
take
in
the
data
frame.
But
in
this
case
we
just
will
take
the
first
two
rows
to
be
consistent
and
there's
no
TF.
A
A
So
you
know
we
can
see
that
okay
looks
like
we
might
have
done
something
incorrectly,
we've
got
the
same
rose.
We've
got
different
values,
but
wait
a
minute.
These
are
different
columns.
The
ordering
of
columns
might
be
slightly
different
depending
on
how
we
did
things
it's
possible
that
we
have
changed
the
ordering
of
columns
unintentionally,
so
let's
actually
make
sure
we're
taking
a
look
at
the
same
columns
so
res.
So
we'll
see
this
as
a
result.
Columnar
we'll
save
this
as
res
original.
A
A
So
this
is
the
process
of
debugging,
so
we'll
go
back
to
the
original
code
and
it
will
take
a
look
and
we'll
see.
Okay.
Well,
maybe
sorry,
I
have
this
I
have
the
original
code
open
in
a
new
tab
and
I'll
look
at
it
here?
No,
maybe
there
is
something
going
on
that
we
missed
I'm.
Is
it
possible
that
we
don't
have
these
strings
formatted
correctly,
because
we're
missing
a
column?
You
know
we
wouldn't
expect
to
be
missing
a
column.
A
A
So
perhaps
we
have
a
logic
error
in
the
column
operations.
Maybe
we
are
doing
a
sum
that
is,
you
know
not
actually
creating.
So,
let's
maybe
maybe
we're
getting
an
error.
So,
let's
see
if
we
got
an
error,
perhaps
some
of
these
actually
error
it
out
looks
like
a
bunch
of
them
air
it
out.
So
let's,
let's
actually
see
why
this
error
down.
This
is
really
important.
A
It's
just
part
of
the
process,
so
when
you're
working
with
someone
else's
code,
you're
trying
to
port
it
in
you
know
a
live
session
or
any
session.
No,
it's
not
always
clear
what's
happening,
so
we
got
some
errors.
So
let's
catch
the
error,
let's
just
catch
the
generic
exception
as
E
and
let's
actually
print
E
and
see.
What's
going
on.
A
A
We
seemingly
aren't
able
to
index
into
this
for
really
any
of
these
of
these
routes
like
if
we
print
out
our
C
and
E,
looks
like
for
all
of
these
rows,
we're
not
able
to
index
in
so
why
is
that?
Let's
take
it,
let's
sort
of
try
to
understand
that,
so
we've
got
these
columns
and
we
know
that
this
is
going
to
be
part
of
sorry,
not
that
we've
got
these
columns
and
we
know
it's
going
to
be
something
that
comes
from
our
data
frame,
and
so,
let's
see
what
this
would
be.
A
A
Okay,
so
this
seems
like
it
should
have
this
right.
It
seems
like
we
should
have
this,
so
something
is
off.
So
what
are
the
ones
we
don't
have?
We
don't
have
this
set,
so
why
do
we
think
this
is
the
case
I
wonder
if
perhaps
it's
because
we're
doing
the
lists
incorrectly?
That
seems
like
it's
me.
It's
a
very
likely
candidate
list
handling
is
something
that
you
know
it's
easy
to
screw
that
up
and
there
we
go.
So
what
are
we
just
learned,
I?
A
What
I
just
did
was
I
looked
just
to
make
sure
we
had
two
columns.
I
saw
that
we
did
so
then
it
eliminated
all
of
the
work
that
wasn't
operating
on
combinations
of
columns
kind
of,
and
so
then
I
double-checked
that
by
getting
rid
of
the
lists-
and
so
in
this
case,
presumably
I
just
need
to
instead
of
using
a
comma
to
do
this
with
like
double-double
things
right
here,
I
presume
at
least
need
to
actually
just
do
this
and
instead
come.
These
are
our
a
lists.
A
I
can
probably
just
combine
the
lists
like
this
and
I
suspect.
Now
we
will
not
get
any
errors,
but
you
know
fool
me
twice,
you
know
and
there
we
go
success
so
obviously
in
the
real
in
the
real
workflow.
We
would
not
just
use
two
rows
to
verify
that
this
is
correct,
but
you
know
it's
nice
to
see
that
it
looks
correct
and
the
logic
made
sense
so
it
shouldn't
it
should
be
correct,
but
we
would
of
course
verify
this
actually
and
so
at
this
point,
we'd
say:
okay!
Well,
why
do
we
do
this?
A
We
didn't
do
this
to
improve
the
CPU
code,
which
is
nice
I
mean.
Hopefully
this
has
improved
to
CPU
code.
So
let's
actually
take
a
look
at
the
CPU
code,
speed
and
then
we'll
actually
run
this
on
the
GPU.
So
we
have
this
Quarry
sample
and
you
know-
let's
let's
say
we
know
it
works
now.
So
let's
just
take
this
again.
Let's
take
this
farther
down
where
we
have
some
space.
So
what
we're
gonna
do
is
say,
maybe
look
at
100
rows
and
to
run
this.
The
original
one
with
a
hundred
rows.
A
A
And
as
expected,
it's
much
faster,
that's
great,
because
we
are
also
going
to
run
this
in
the
GPU
in
a
second
so
with
100
rows.
It
took
this
long.
I'm
not
gonna,
run
this
anymore
because
it's
scaling
linearly
so
with
a
thousand
rows.
It'll
we'll
be
here
forever,
but
like
with
a
thousand
rows.
This
should
be
pretty
quick
less
than
a
second
Plus
through
on
the
GPU.
Now
this
code
will
run
on
the
GPU.
All
of
these
api's
exist
on
the
GPU.
We
can
run
them.
So
let's
do
it.
A
A
So
the
first
time
we
ran
this,
we
run
this
can
take
a
little
bit
longer
because
it's
gonna
get
compiled.
But
so,
let's
we'll
run
it
again
just
overload
the
digit
compilation.
So
it's
1.8
seconds.
So
it's
actually
a
little
slow
with
a
thousand
rows.
The
panda's
version
was
faster,
and
so
why
is
that?
Well
it's
the
way
we
the
way
we
ported
this.
We
are
doing
40
iterations
within
each
iteration,
we're
making
a
call
to
some
six
times.
A
So
we're
doing
240
separate
kernel
calls
no
matter
how
many
rows
were
doing
so
if
we
do
one
row
we're
making
240
separate
kernel
calls
if
we
do
1
million
rows,
it's
the
same
thing.
So
there's
there's
an
overhead
of
that
of
those
kernel
calls.
So
let's
see
what
that
means.
Let's
go
to
10,000
rows,
so
the
pandas
version
it's
losing
vectorization.
So
it's
it's
gonna
be
much
faster
than
the
original.
A
In
fact,
it
hopefully
will
do
a
thousand
in
the
speed
you
know
about
ten
seconds
or
maybe
so
the
pandas
one
took
less
than
a
second,
and
now
it
took
about
ten
seconds.
So
it
looks
like
when
we
scaled
up
from
a
factor
of
a
thousand
to
a
10,000.
We
scaled
up
linearly
by
a
factor
of
ten
our
time
scaled
roughly
linearly
by
a
factor
of
ten.
What
about
on
the
GPU,
so
we
scaled
significantly
significantly
faster.
We
can
keep
this
time
way
down.
You
know
roughly
in
this
case,
1.8
seconds
again.
A
What
about
a
hundred
thousand
so
I'm,
not
gonna,
run
this
on
the
CPU,
because
I
can
tell
you
that
it
scales
linearly
like
we
know
that
and
it's
gonna
take
a
hundred.
It's
gonna
take
120
seconds,
so
I'm
not
gonna
waste
the
time,
but
with
a
hundred
thousand
rows
the
GPU
version
will
still
take.
You
know
two
to
three
seconds
and
with
two
hundred
thousand
rows,
the
GPU
version
will
still
take
very
little
time,
and
so
we
know
the
amount
of
time
it's
actually
gonna
take
is
gonna
depend
on
how
you
how
you
poured
it.
A
You
know
we
know
from
above
that
you
know
this
is
not
the
optimal
port
and
we
could
probably
optimize
this.
We've
already
done
some
of
these
calculations.
We
don't
need
to
do
a
kernel
call
here.
We
can
actually
do
a
binary
operation
between
you
know
these
two
sums
that
we've
created.
We
can
also
probably
potentially
even
unroll
some
of
these
loops.
If
we're
clever
about
it.
But
you
know
in
general,
we
just
went
from
a
thousand
rows
in
12
seconds,
with
pandas,
with
the
improved
CPU
version
to
200,000
rows
in
four
seconds
with
the
GPU.
A
Now,
obviously,
that's
great
you
can
do
this
whole
workflow
that
we
estimating
was
gonna,
take
two
and
a
half
hours
we
can.
We
can
actually
do
it
in
like
30
seconds,
which
is
just
awesome,
but
it's
also
great
because
this
code
scales
I'm,
not
gonna,
pull
it
up
on
a
big
cluster,
but
I'm.
Just
gonna
show
you
to
demonstrate
cuz
we're
almost
out
of
time
that
we
can
put
this
on
a
desk
in
a
desk
data
frame,
so
I'm
using
the
desk
data
frame
API
that
we
just
used.
A
It's
gonna
ask
me
to
set
a
number
of
partitions.
You
know
this
is
a
fairly
sizable
data
set.
It's
got
200,000
rows,
it's
not
too
big.
I'll
put
this
in
ten
partitions.
It's
not
super
important
just
need
to
do
it,
so
this
is
a
desk.
We
can
run
this
same
code
on
that
gas
dataframe,
because
none
of
these
api's
are
anything
unique.
This
is
gas
compatible.
So
you
know
when
Taylor
mentioned
earlier,
that
is,
can
you
know
his
work?
His
actual
workload
is
in
the
terabytes
of
data.
A
A
This
will
run
on
the
data
frame.
Now
the
data
frame
has
ten
partitions
and
because
we're
not
using
any
parallelism,
it's
going
partitioned
by
partition.
So
in
this
case
using
desks
like
this
is
actually
gonna
be
slower.
But
if
we
had
a
lot
of
GPUs,
we
could
use
this
and
split
up
this
work
incredibly
efficiently.
I'm
fine
I
just
use
one
partition.
It
would
be
as
if
we
were
using
the
ku
DF
data
frame
rather
than
a
desk.
A
Who
do
you
have
data
frame
and
it
would
be
very
quick,
but
I
wanted
to
sort
of
show
that
this
same
code
will
run
on
the
GPU,
with
both
ku
DF
and
with
tasks.
This
is
taking
too
long,
I'm
being
impatient,
so
I'm
gonna
actually
recreate
this
as
a
single
partition.
Data
frame
and
I'm
gonna
run
this
again
because
I'm
impatient,
but
in
general
this
is
you
know,
tasks
add
some
overhead
because
it
has
to
do
orchestration
and
it's
gonna.
Do
these
sums
a
little
bit
differently,
but
it's
gonna
run
and
it's
it's
actually.
A
It
seems
to
be
adding
a
decent
bit
of
overhead
in
this
case,
but
in
general
and
it's
running
and
it's
taking
a
little
bit
more
time
because
of
the
the
overhead,
but
we
could
also
run
this
with
the
map,
partitions,
API
and
just
pass
this
function.
This
is
a
way
to
pass
functions
to
the
underlying
objects
in
memory
and
there
so
that
you
know
that
was
just
as
fast
and
so
the
result
of
this
is
a
data
frame,
but
now
it's
a
data
frame
that
has
our
results
in
it.
If
we
call
persist.
A
Eventually,
it
will
finish
well
as
the
risk
of
live
coding,
but
in
general
this
pattern
will
work,
and
you
can
see
that
you
know
task
is,
is
we're
adding
some
overhead,
but
it's
allowing
it
to
succeed
and
so
yeah.
So
that's
how
we
would
take
this
workflow
and
we
didn't
port
the
second
portion.
Yet
we
ran
a
time,
but
we
ported
the
most
important
part
that
took
up
95
99
percent
of
the
time
and
we
made
it
go
from
two
and
a
half
hours
or
you
know
hours
in
the
in
the
seconds.
A
A
A
D
A
So
I
think
in
general
we
always
want
to
operate
on
columns,
not
rows.
So
if
we
can
distill
the
logic
down,
I
would
say
something
like
try
to
rethink
your
operations
to
operate
on
columns
rather
than
rows
by
doing
that,
it
lets
you
use
existing
api's
and
that
that
often
is
gonna
help
your
CPU
code.
You
know
we
saw
down
here
that
by
restructuring
this
we
can
improve
the
pandas
version
quite
a
bit.
A
You
know
it
doesn't
solve
the
problem
because
it's
still
gonna
take
too
long
because
it
took
ten
seconds
for
a
freedom,
a
thousand
rows
or
whatever
versus
four
seconds
for
hundreds.
Thousands
like
that's,
not
gonna
cut
it,
but
it
still
made
an
improvement
with
GPUs.
That
improvement
is
times
a
thousandfold.
So
that's
why
it's
so
important.
That's
one
number
two!
Don't
try
to
over
optimize
before
you
get
some
some
wins.
We
don't
want
to
move.
You
know
you
don't
move
a
waterfall
all
at
once.
A
It's
inch
by
inch
every
year
the
waterfall
moves
backward
like
five
inches.
That's
like
yeah.
What
happens?
You
can
still
get
big
speed
ups
by
making
the
quick
changes
that
you
know
make
sense.
You
know
we
did
this
all
on
the
span
of
realistically
about
30
40
minutes.
You
know
we
could
of
course
optimize
this
and
in
fact
it
probably
if
I
had,
if
I
had
actually
done
that
before
I
wouldn't
have
made
the
mistake
of
trying
to
combine
two
lists
incorrectly.
A
But
you
know
little
syntax
errors
happen,
but
don't
over
optimize
like
you
know,
maybe
eventually
now
we
could
say
well.
Do
we
even
need
these
like
do
we
have
to
do
all
these
separate
stones?
Can't
we
do
this
in.
You
know
in
some
other
way
that
combines
these
kernel
calls
and
fuses
them.
Maybe
yes,
maybe
no.
We
don't
know
that
until
we
try,
but
we
don't
need
to
try
until
we
know
that
if
necessary,
this
optimization
makes
the
entire
workflow
on,
like
three
million
three-and-a-half
million
rows
run
in
you
know
less
than
a
minute
ish.
A
So
maybe
that's
fine.
Maybe
we
don't
need
to
get
faster
than
that.
Going
from
you
know,
two
and
half
hours
to
a
minute
might
just
be
enough,
and
so
don't
try
to
over
optimize
number
number.
Three
would
be
actually
measure
correctness.
You
know
we
didn't
do
that
here.
You
know
we
kind
of
just
eyeballed
it
because
we're
doing
it
yeah.
We
were
just
eyeballing
it,
but
it's
important
to
do
more
than
just
actually
say:
oh
yeah.
These
are
the
same.
These
are
the
same.
A
These
are
the
same,
actually
measure
correctness
and
we
actually
put
we
actually
put
out
a
blog
recently
on
on
the
semi
recently
on
the
rapids
medium
page
about
you,
know,
measuring
correctness
and
workflows,
I'll,
pull
it
up
quickly
if
I
can,
if
I
can
find
it,
but
if
not
no
issue
yeah.
So
here's
an
example
odd
for
for
doing
this,
like
verifying
correctness,
there's
a
bunch
of
things,
you're
gonna
want
to
check
things
like
the
types
are.
The
index
is
actually
the
same.
Are
the
columns
in
different
orders?
Are
you
perhaps
using
different
precision?
A
A
Now
it's
important
to
think
about
that,
and
so
we
put
together
a
checklist
and
I'll
share
this
in
the
chat
as
well,
but
make
sure
that
you
really
do
measure
correctness
because
there's
nothing
worth
doing
if
you're
trying
to
actually
solve
a
research
problem
or
an
infrastructure
problem
in
this
case,
if
you
think
you
solved
it
faster,
but
you
actually
haven't,
and
then
you
know,
one
thing,
I
would
say
is
come
back
with
a
fresh
mind.
That's
probably
the
fourth
thing
to
this.
You
know
right
now:
we've
done
a
good
job.
A
Hopefully
we
made
a
couple
of
mistakes.
We
found
him
pretty
quickly.
Eventually
we
might
find
a
new
way
to
come
back
to
this,
but
we
don't
want
to
know
the
labor
the
point.
If
we
can
get
a
win,
we
got
a
win
and
that's
great
you
know
this
workflow
is
now
fully
on
the
GPU,
obviously
that
we
didn't
port
the
second
function,
but
we
did
port
the
second
function
with
Taylor
originally
and
again
it's
it
looks
quite
similar
to
this
the
the
way
we
ported
the
second
function.
A
A
C
We're
also
going
to
email
that
to
everybody,
but
I
I
do
realize
that
for
a
lot
of
people
on
the
call
it's
getting
late
for
Nick,
it's
getting
late
for
Zahra,
if
she's
still
on.
But
if
you,
if
you
do,
plan
to
drop
off
here
at
the
break,
just
watch
for
that
email.
But
we
would
really
encourage
people
to
stick
around
for
the
last
part,
especially
because
there's
probably
a
lot
of
questions
and
answers.
But
questions
but
yeah
during
the
break
be
thinking
about
those
but
I'll
reconvene
us,
maybe
3
325.
A
And
there
just
a
note
wall,
another
break
just
started.
I
just
ran
this
on
a
million
rows.
Just
for
the
sake
of
it.
We
did
a
million
rows.
I'm,
sorry
it's!
Actually
there
was
only
a
hundred
thousand.
Excuse
me
never
mind.
I
am
incorrect,
but
I
will
run
this
in
a
million
rows
and
we'll
see
how
long
it
takes.
But.
A
Tuned,
so
in
a
in
a
different
scenario,
we
would
obviously
not
be
doing
all
these
copies.
We
wouldn't
be
doing
it
like
this.
We
would
just
kind
of
read
it
in,
but
in
this
case
1
million
rows,
columnar,
not
perfect.
You
know
we
could
have
optimized
it
more.
As
we
said
like
this
is
like
this
is
a
waste
of
computation
like
it
is
objectively
a
waste.
We
could
just
do
this
with
addition,
cuz.
We
already
we've
already
done
both
of
these
sums,
but
that's
ok.
Even
without
optimizing.
A
A
Ten
thousand
rows.
It's
gonna
take
ten
seconds
with
a
million
rows.
It
takes
half
the
time
on
the
GPU,
so
we
did
it
in
one.
You
know
fifty
percent
of
the
time
or
in
one
third
of
the
time
we
could
do
a
hundred
times
as
much
compute.
So
that's
a
300.
You
know
loosely
a
three
time
speed
up,
which
is
awesome.
That's
just
really
exciting
and
that's
all
I
got.