►
From YouTube: Apache TVM Community Meeting, April 6, 2022
A
Okay,
hi
everyone,
so
this
is
the
tdm
community
meeting.
So
now
we
are
doing
it
each
week,
this
kind
of
a
gathering
of
people
who
participate
in
the
tdm
community,
so
this
meeting
relies
mostly
on
on
having
an
agenda
proposed
by
the
community.
A
So
I
guess
a
couple
reminders
from
from
from
our
side
is
that
we
are
always
looking
for
volunteers
to
propose
new
topics
like
today's
topics
that
will
be
collage
presented
and
the
discussion
will
be
organized
by
mark,
and
the
second
thing
is
that
we
are
also
trying
to
have
some
more
people
hosting.
These
meetings,
like
I
am
today
and
like
andrew,
was
doing
last
week
so
to
guarantee
the
continuity
of
the
session.
A
A
You
can
just
post
there
or
you
can
contact
us
directly
on
the
discord,
for
example,
if
you
want
to
either
present
or
host
one
of
these
sessions
now
to
start
on
the
on
the
announcement
community
announcements
for
today,
I'm
very
happy
to
welcome
gustavo
romero
and
also
murdered
hazard
as
new
committers
for
the
project,
so
basically
a
recognition
on
of
your
kind
of
a
great
effort
in
pushing
prs
and
reviewing
code
and
being
active
in
the
discussions
on
the
forum
and
in
all
the
sort
of
channels
that
we
have
for
this
community.
A
So
yeah
congratulations
both
and
I
guess
this
is
mostly
what
I
have
as
an
introduction.
Again,
we
are
recording
this
session,
so
it
will
be
available
after
the
meeting
in
case.
You
want
to
refer
this
to
someone
and
share
the
video
as
well,
so
that
people
can
see.
I
mean
the
great
discussions
and
things
we
host
here
every
week
and
with
that
I
guess,
I
can
hand
over
to
mark
shultz
who's
going
to
present
the
main
agenda
topic
for
today,
which
is
the
collage
new
contribution
that
was
recently
merged
into
the
code
base.
B
A
B
We're
a
pretty
small
crowd,
so
why
not
folks
just
interrupt
yeah,
okay,
and
if
you
just
want
to
chime
in
on
audio,
I
I
think
the
trouble
posting
questions
in
chat
is
good,
but
it
tends
to
encourage
like
a
large
questions
that
aren't
necessarily
in
sync
with
the
presentation,
and
it
makes
it
a
little
harder.
So
since
we're
a
small
crowd,
why
don't
we
just
take
yep,
okay,
okay,
I'll!
Do
the
screen
share
thing?
Bear
with
me?
B
There
you
go,
let's
see
if
it
works.
This
is
a
new
machine.
So
forgive
me
if
there's
shenanigans,
I
have
to
go
through
how's
that
yeah
sent
slideshow
all
right
yeah.
So,
as
leandro
mentioned
this
went
in
when
was
it
last
week
I
really
with
this
rfc.
I
really
want
to
try
to
be
more.
You
know
this
isn't,
like
you
know,
written
in
stone
tablets
and
the
rsc
is
fixed,
and
everything
from
here
on
is
fully
determined.
B
I
do
want
to
be
able
to
go
back
around
this
rfc
and
kind
of
revise
it
as
we
learn.
More
parts
of
this
project
are
quite
speculative
as
you'll
see,
and
I
think,
as
we
get
deeper
into
actually
checking
stuff
into
maine,
I'm
pretty
sure
we'll
want
to
go
back
and
revise,
and
ultimately
I
would
like
to
make
this
kind
of
you
know
an
as
built
and
as
implemented
record
of
of
what
we
did
and
it's
already
out
of
date,
just
based
on
my
own
prototyping
anyway.
B
So
I'm
I've
been
mostly
chipping
away
at
this
michaelis
papadimitrio
who's
actually
in
this
session,
thanks
for
which
is
actually
a
good
time
for
him.
He's
in
he's
in
greece
he's
been
working
on
this
matthew.
Barrett
is
a
new
octo
ml.
Employee
he's
just
started
with
us
and
sung
park
is
one
of
the
we're
very
lucky
to
have
him
here
at
octo
ml
he's
one
of
the
authors
on
the
preprint,
which
I'll
mention
later,
but
most
of
the
problems.
You
can
blame
on
me.
B
So
what
what's
the
basic
idea
here?
Well,
you
know:
let's,
let's,
let's
start
with
mnist,
I
took
out
one
layer
because
once
you've
seen
one
layer,
you've
seen
them
all,
and
so
the
idea
here
is
that
we're
we're
wanting
to
kind
of
find
an
optimal
partitioning
for
your
overall
model
graph,
and
that
is
basically
you
know.
We
have
all
of
these
available
back
ends.
One
of
the
very
unique
features
about
tvm
is
its.
B
You
know
ability
to
support,
bring
your
own
compiler
plugins
and
those
plugins
can
work
at
very
different
levels
of
abstraction.
Some
of
them,
like
tensor
rt,
almost
want
to
be
whole
model.
B
Compilers
others,
like
the
tvm
back
end
itself,
is,
is
all
about
good
scheduling
for
kernels
cutlass
is
kind
of
more
of
a
library
with
its
own
tuning
infrastructure,
and
then
we
have
more
kind
of
library
based
things,
so
we
really
want
to
make
sure
that
we
bring
kind
of
the
sense
of
of
tuning
or
optimality
to
this
mix
and
matching
of
all
of
these
byoc
plugins
and
tvm's
own
lowering
machinery
in
the
same
way
that
tvm
itself
does
tuning
for
schedules,
and
so
basically
we
want
to
go
from
that
to
something
like
this.
B
I'm
just
making
this
up
don't
take
this
as
representative,
and
this
particular
partitioning
has
decided
that
tensor
rt
does
a
particularly
good
job
with
fused
our
old
friend
confidu,
the
the
dense
or
the
matte
mole
transpose
down.
The
bottom
turns
out
to
be
most
efficient
on
kublas
and
tvm
is
left
behind
to
kind
of
both
fill
in
the
kernels
that
are
remaining
so
in
this
case
it's
a
pad,
a
max
pool
and
a
reshape,
but
also
I
won't
really
get
into
this.
B
It's
also
responsible
for
all
the
additional
glue,
so
remember
at
the
end
of
the
day,
you
know
the
vm
or
the
graph
executor
is
responsible
for
all
the
plumbing,
and
some
of
that
can
be
include,
like
you
know,
actually
pushing
constants
in
and
so
on,
and
so
part
of
the
partitioning
decision
is
also
to
decide
that
yeah,
these
particular
expressions
constants
whatever
don't
need
to
be
fused
into
any
particular
kernel.
They
are
just
executed
by
the
vm
itself,
so
in
effect
there's
kind
of
a
residual
host
partition
in
this
world
as
well.
B
Okay-
and
obviously
you
know,
the
whole
point
of
this
is
so
that
overall
end-to-end
model
hc
is
reduced
compared
to
if
you
kind
of
followed
an
eager
strategy
which
would
be
just
using
tvm
or
just
using
partition
for
tensor,
rt
or
even
carefully
constructing
some
chain
of
you
know,
partition
for
tensor
rt,
then
partition
for
couplers
then
partition
for
something
else
to
try-
and
you
know
see
if
you
can
kind
of
find
that
you
know
that
optimality.
Instead,
we
simply
fall
back
on
on
using
measurement
to
guide
this
optimal
selection
strategy.
B
So
and
that's
that's
honestly
about
it
in
terms
of
the
setup
and
everything
that
follows
from
this
is
all
engineering,
so
let
me
get
into
that.
So
obviously
this
is
based
on
pre-print.
I
just
you
know
just
to
declare
I'm
not
trying
to
do
research
here.
I
really
want
this
to
be
an
engineering
project,
so
I'm
taking
my
job,
as
you
know,
basically
picking
up
the
research,
that's
already
been
done
and
kind
of
getting
that
into
maine
in
a
in
a
sustainable
and
reusable
way.
B
So
I'm
sure
there's
going
to
be
lots
of.
Oh,
we
could
do
this
and
we
could
do
that
and
have
you
seen
this
paper
and
so
on?
I'm
deliberately
kind
of
putting
a
little
bit
of
a
wall
around
us
so
that
we
don't
get
drawn
into
that
because
I
do
want
this
to
kind
of
be.
You
know
very
much
an
engineering
project
anyway.
So
that's
the
pre-print.
B
I
am
not
going
to
talk
about
any
I'm
not
going
to
present
any
graphs
and
so
on
this
isn't
an
analysis
conference
talk
or
anything.
I
defer
all
of
the
performance
questions
to
the
paper
now.
Obviously,
here
at
octoml
we
have
extensive
infrastructure
for
doing
performance
comparisons
and
sweeps
across
all
sorts
of
models.
B
A
lot
of
those
models
aren't
even
public
they're
actual
models
that
our
customers
have
given
to
us.
So
obviously,
internally,
we
are
paying
very
close
attention
to
that,
but
we're
still
in
the
process
of
building
that
infrastructure,
we're
trying
to
connect
what
we're
doing
here
into
that
existing
infrastructure,
and
so
that's
why
I
don't
want
to
start
to
throw
bar
charts
at
you
or
anything
like
that,
because
I'm
still
not
confident
myself.
B
Nevertheless,
the
paper
shows
that
indeed,
you
can
do
better
if
you
kind
of
use
actual
empirical
measurements-
and
you
are
prepared
to
be
very
flexible
in
mixing
and
matching
between
the
different
backhands.
The
paper
does
demonstrate
that
you
can
do
better
beyond
simply.
You
know
partition
for
tens,
rt
and
letting
tv
and
do
the
rest
things
like
that.
B
Just
for
those
who
happen
to
be
familiar
with
the
paper,
if
you're
not
just
ignore
what
I'm
about
to
say.
So
this
the
rfc
is
basically
the
pre-print
we've
taken
away
the
evolution,
research
aspect
from
the
paper.
I
know
ml
cis.
Folks,
love
evolution,
research,
it's
a
it's
a
great
way
to
do
all
sorts
of
fun
things,
we're
we're
just
sticking
to
basic
dynamic
programming
approach.
For
now
that
we
also
have.
B
We
were
quite
worried
in
looking
at
how
to
bring
the
paper
into
maine
in
a
sustainable
way.
We
were
quite
worried
that
the
the
papers
prototype
implementation,
relied
on
a
whole
new
library
of
if
you
like,
fusion
patterns
or
byo
c
patterns
with
its
own
fusion
engine
and
so
on,
and
we
were
quite
concerned
that
we'd
kind
of
end
up
with
you
know
a
non-scalable
process,
namely
that
every
time
one
of
our
mls
customers
came
to
us
with
a
new
model
and
a
new
target
we'd
have
to
go.
Oh
wait:
okay!
B
Well,
this
byoc
has
these
patterns,
but
now
we
have
to
replicate.
You
know
all
of
that.
All
of
that
logic
up
in
the
collage
level.
We
just
felt
that
that
wasn't
sustainable.
So
one
of
the
big
things
we've
tried
to
do
in
this
this
kind
of
version.
The
rfc
version
is
to
make
sure
that
we
directly
piggyback
on
byoc.
B
We
also
the
paper,
also
kind
of
introduced,
a
new
notion,
that's
orthogonal
to
targets
and
devices
called
back
ends.
We've
taken
that
notion
and
folded.
It
back
into
targets
which
you'll
see
later,
and
the
thing
that
we
have
added
is
we've
kind
of
lent
very
heavily
into
this
new
notion
of
a
partition
specification
which
is
kind
of
has
its
own
little.
It's
like
a
it's
a
little
bit
like
df
patterns.
B
In
fact
it's
built
on
top
of
df
patterns,
but
it's
its
own
library
of
base
and
combinator
rules
that
when
you
combine
them
in
different
ways,
you
can
express
different
partitioning
strategies
and
that
flexibility
means
that
we
can
make
the
overall
search
much
more
efficient
and
we
can
also
without
getting
drawn
into
endless
patterns
and
so
on.
We
can
tune
the
search
for
the
different
byoc
targets,
I'm
not
really
going
to
get
into
that,
except
for
one
slide.
We
can
talk
about
that.
B
If
you
want
to,
everyone
is,
of
course,
welcome
to
take
a
look
at
the
tree.
We
are
just
starting
to
peel
things
off
the
tree
and
start
to
push
through
to
maine.
Now
that
the
rc
is
kind
of
in
place,
I
expect
it'll
take
us
a
month
or
two
to
kind
of
chip
away.
All
comments
are
welcome.
B
If,
during
the
pr
review
we
realize
yeah,
we
didn't
quite
get
this
right
very
happy
to
backtrack
to
the
rfc
and
we
you
know
we
we
follow
through
and
just
if
you
did
want
to
actually
try
out
the
the
tree.
I
should
warn
you
I'm
you
know
not
keeping
it
in
it.
It
goes
into
experimental
dead
ends.
So
you
know
look
at
the
code,
but
please
don't
assume
it's
actually
going
to
work
for
you.
B
However,
I
don't
think
I
cover
this
elsewhere,
but
I
am
you
know.
Obviously
we
started
out
with
good
old
mnist
just
to
get
off
the
ground
and
get
the
get
the
basics
working.
We've
moved
on
to
gpt2,
just
because
it's
that's
about
1300
nodes
and
it's
a
good
way
to
kind
of
tease
out
where
you've
accidentally
got
it.
B
You
know
a
super
linear
dependency
on
n
and
you're
you're,
fooling
yourself,
so
it
is
chugging
away
on
gpt2
and
but
again
I'm
not
going
to
talk
about
actual
performance
improvements
for
that.
B
Okay,
so
from
the
outside
we've
tried
to
make
it
kind
of
as
innocuous
as
possible.
So
first
thing
is:
it's
opt
in
so
if
nothing
will
change
other
than
the
a
few
you
know
passes
here
and
there
that
we've
had
to
kind
of
robustify,
but
nothing
is
going
to
change
on
the
mainline
path
if
you
don't
opt
in
so
all
the
existing
byoc
calls
will
still
work
and
you'll
be
running.
The
same
passes
it'll
still
be
running.
Tvm's
built-in
fuse
ops.
All
of
that
stuff.
B
If
you
do
opt-in
and
let's
see
so-
here's
the
oh
oops,
sorry,
I
can't
highlight
text
so
the
relay.collage.enableclash.
True,
that's
what
kind
of
you
know
switches
you
into
the
light,
fantastic
in
your
in
your
past
context,
so
that's
kind
of
first
point
of
customization.
B
The
second
point
is
in
order
to
express
to
the
collage
machinery
what
you
know
what
it
should
be,
considering
exploring
I've
piggybacked
on
targets
and
introduced
a
notion
of
target
refinement,
and
so
here
I'm
I'm
building
up
a
bunch
of
targets,
and
some
of
these
targets
are
just
plain
old
cuda
targets.
However,
they've
introduced
an
additional
attribute
compiler,
which
corresponds
to
the
pyoc
compiler
name,
and
then
I
just
throw
all
those
targets
together
and
pass
them
in,
and
so
this
is
piggybacking
on
the
existing
heterogeneous
target
support
machinery.
B
There's
a
few
changes
I
have
to
make,
because
now
this
is
a.
This
is
a
list
not
a
dictionary
but
they're
all
actually
pretty
minor,
and
thanks
to
matthew,
you
can
now
also
have
compiler
qdn
and
compiler
kubless,
okay,
and
with
that
that
will
trigger
a
new
collage
partitioner
pass,
which
this
is
another
difference
with
the
paper
is
run
very
early.
B
It's
in
fact,
it's
run
as
soon
as
we
can
get
away
with,
and
the
reason
for
that
is
this
practitioner
pass
is
going
to
be
kind
of
it's
you
know
reaching
in
and
looking
at
all
the
byoc
patterns
and
so
on
in
order
to
decide
what
all
the
valid
partitionings
are,
and
we
want
to
make
sure
that
those
byoc
patterns
see
the
same
graphs
that
they
currently
do
with
the
existing
partition
for
tool
chain
convention,
and
that
convention
is
obviously
you
do
your
partitioning
before
you
enter
the
main
tvm
build,
and
so
that's
forced
us
to
put
the
collage
practitioner
way
up
front
in
the
rrc.
B
B
I
think
we've
made
the
right
decision
and
we've
just
moved
it
to
be
as
early
as
possible.
Note
that
the
paper,
however,
does
its
work
actually
inside
as
a
hook
inside
the
existing
fuse
ops,
so
that
that's
another
difference
with
or
how
the
paper
approaches.
B
Okay
and
then
so
the
practitioner
does,
its
thing
goes
off,
does
search
does
tuning.
Does
all
that
stuff.
Its
output
is
actually
no
different
from
the
output
that
you
currently
get
using
all
of
the
existing
machinery.
It's
encoded
all
of
its
decisions
in
exactly
the
same
way
that
they're
currently
encoded
using
primitive
functions
with
compiler
attributes.
The
bodies
of
those
functions
may
have
composite
functions.
B
Basically,
we've
had
to
make
sure
that
the
compiler,
if
you
simply
take
a
module,
that's
already
been
kind
of
rewritten
to
use
these
primitives
and
so
on.
In
any
form
you
want,
and
you
just
pass,
that
through
tvm,
you
get
exactly
what
you
expressed.
That's
not
quite
the
case
at
the
moment.
There's
a
few
little
glitches,
but
we've
fixed
those,
okay
and
and
so
yeah
after
collage
has
done
its
thing.
We
just
let
compilation
proceed,
there's
no
downstream
changes
and
all
of
the
existing
lowering
dispatch
and
unification.
B
B
Okay,
so
that's
that's
kind
of
on
the
outside
on
the
inside.
I
kind
of
didn't
want
to
go
too
deep,
so
I
thought
I'd
stay
more
on
the
you
know,
light
and
fluffy
side,
and
we
can
just
chat
about
anything
that
takes
people's
interest.
I
should
mention
I'm
working
on
some
additional
things
which
are
not
in
the
rfc,
and
so
I
I
need
to
go
back
and
put
some
more
explanatory
text
in
there,
but.
C
B
Let's
just
kind
of
skirt
along
the
surface
see
how
we
do
so.
You
know
on
the
inside.
What
is
this
mysterious
collage
practitioner
pass?
Do
well,
you
know,
naively,
you
can
just
kind
of
try
them
all
and
you
know,
given
the
no
expense
spared
approach
which
we
have
to
tuning
you
know,
maybe
you
know
it's
not
completely
outrageous
that
we
would
actually,
you
know
almost
brute
force
our
way
through,
but
thankfully
we
don't
need
to
do
that,
and
so
we
can
bring
this
back
into
the
realms
of
of
practicality.
B
Certainly
as
soon
as
you
start
dealing
with
n
factorial,
you
know
because
a
complexity
class
you
have
to
trend
very
very
carefully,
so
there's
basically
two
main
assumptions.
The
the
first
is
that
we
can
kind
of
rely
on
the
existing
boac
patterns,
along
with
some
very
simple
kind
of
fusion
styles,
in
order
to
kind
of
get
what
I've
been
calling
ideal
partitions
and
we
can
kind
of
recover
those
ideal
partitions
kind
of
independently.
B
So
each
each
potential
back
end
can
recover
its
notion
of
ideal
partitions,
independently
of
all
the
others,
and
we
can
proceed
from
there.
And
so
what
is?
What's
an
ideal
partition,
maybe
I'm
getting
a
little
too.
You
know
hung
up
on
notation
here,
but
so
the
idea
is
that
an
ideal
partition
is
kind
of
like
a
goldilocks
partition.
It's
not
too
large
and
it's
not
too
small
right.
B
So
we
want
it
to
be
as
large
as
possible
because,
let's
say
you
know
the
part.
Let's
say
that
we
on
the
one
hand
we
have
a
partition,
conf
2d
ad
and
another.
We
have
the
partitions
confidenti
and
ad
separately.
B
Well,
obviously,
if
we
don't
explore
the
the
partition
that
has
both
of
those
operations,
we're
missing
out
on,
you
know
the
fusion
opportunities
and
other
optimizations,
which
is
the
whole
point
of
of
this
work,
and
so
we
want
to
make
sure
that,
when
we're
dealing
with
ideal
partitions,
they're
kind
of
large
so
that
we
get
lots
of
opportunities
for
the
various
byoc
backends
to
kind
of
you
know
flex
their
muscles
and
and
trigger
all
the
fusion
and
optimization
that
they
want.
B
But
on
the
other
hand,
we
don't
want
them
to
too
big,
I
mean
because
then
we're
kind
of
you
know
stuck
having
to
explore
this
huge
space.
So
we
so
we
we
do,
we
want
them.
We
don't
want
them
so
large
that
if
we
split
them
you
would
get
much
the
same
execution
time
as,
if
you
you
know,
measured
them
together.
So
in
other
words,
let's
say,
we've
got
two
confides
for
argument's
sake
in
succession.
B
We
could
say:
oh
well,
obviously
the
ideal
partition
is
confidentially
confidentially,
but
the
execution
of
time
of
that
for
a
particular
boioc
target
is
probably
the
same
as
just
having
two
partitions
confidentially
conflicted
and
adding
their
execution
times,
and
that's
because
by
unioning
those
things
we're
not
kind
of
opening
up
any
more
optimization
possibilities.
B
So,
basically,
with
a
by
being
careful
with
these
rules,
we
can
make
sure
that
the
starting
point
for
the
search
is
kind
of
primed
from
partitions
that
are
kind
of
sensible,
that's
kind
of
the
hand
wavy
way
of
saying
it,
and
then
the
second
simplifying
assumption-
and
this
one
is-
is,
you
know
more
suspect
and
is
why
the
paper
explored
using
evolution.
Research
so
we're
assuming
that
when
we
have
two
partitions
that
we're
exploring
that
their
costs
are
additive
and
so
basically
given
to
partitions.
B
The
search
is
assuming
that
the
cost
of
running
a
and
b
you
know,
as
a
single
run,
is
the
same
as
the
cost
of
running
a
and
the
same
across
cost
of
running
b
in
isolation
and
plus
a
small
penalty
to
account
for
the
fact
that
yeah
you
had
to
launch
a
kernel
or
make
some
other.
You
know
there's
some
overhead
to
do
to
do
that
call
and
that
assumption
is
patently
false.
B
I
mean
we
have
cache
effects
and
all
sorts
of
other
things
that
mean
costs
aren't
additive,
but
nevertheless,
it's
a
simplifying
assumption,
which
means
that
we
can
now
just
use
a
classical
dynamic
programming
approach
to
doing
this
search
and
in
fact
the
rfc
uses
dijkstra
just
because
I'm
trying
to
make
I'm
tr,
I'm
hoping
that
we
don't
have
to
explore
the
whole
space
that
we
can
kind
of.
You
know
once
once
you
get
to
a
particular
point
in
the
search
space
and
you
realize
there's
a
very
expensive
option.
B
Well,
you
don't
need
to
waste
time.
You
know
branching
out
from
there,
in
other
words,
to
use
the
the
classic
shortest
path
terminology,
I'm
hoping
that
the
graph
has
a
low
bloom
factor
and
that
you
can
kind
of
fairly
quickly
kind
of
just
find
your
your
shortest
path,
and
I
think,
from
here
on.
I
have
some
pictures
because
I
spent
all
this
time
drawing
them
for
the
rfc
and
figured
well.
B
If
I
spent
all
that
time
we
might
as
well,
we
might
as
well
look
at
them
so
so
on
the
inside.
Obviously,
we're
going
to
be
doing
lots
and
lots
of
work
with
sub
graphs.
This
is
kind
of
our
core
data
type
and
so
from
the
paper
did
it
this
way,
and
I
thought
it
was
a
very
nice
idea.
So
basically,
you
assign
a
post
dfs
index
to
every
node,
which
is
already
done
as
part
of
the
the
the
index
graph
machinery.
B
That's
already
inside
tvm,
which
is
part
of
the
df
pattern
machinery.
So
we
assign
a
unique
id
to
to
every
node
and
now
and
now
we
can
build
a
very
efficient
representation
for
subgraphs.
So
we're
going
to
be
ex,
you
know
we're
going
to
have
many.
You
know
thousands
like,
I
think,
gpt2
we
end
up
with
about
4
000
kicking
around,
and
so
we
can
just
represent
them
very
efficiently
as
as
bit
vectors
and
then
there's
also
this
whole
machinery
for
for
partition
rules
which
I
mentioned
early
on.
B
So
when
collage
begins,
it
looks
at
the
targets
and
looks
at
for
the
compiler
annotations
attributes
in
those
targets
and
from
there
it
goes
off
and
looks
at
the
byc
plugins
and
and
it
basically
from
that
information
imports
it
and
builds
its
own
representation
for
partitionables,
and
then
those
partition
rules
can
be
if
you're
like
executed
on
a
dataflow
graph
in
order
to
yield
a
set
of
candidate
partitions
and
a
candidate
partition
is
the
sub
graph
and
the
target
that
you're
you're
wanting
that
subgraph
to
be
compiled
for,
and
so
this
side
is
just
showing
you
that
we
kind
of
actually
compose
those
patterns
in
order
to
affect
the
the
kind
of
rules
that
we're
looking
for.
B
In
this
case,
it's
looking
like
it's
a
it's
kind
of
more
like
a
cutlass
style
integration,
where
it's
now
building
up
a
whole
set
of
possible
partitions
based
on
df
patterns
that
are
kind
of
pulling
out
the
primitives
and
additional
fusion
combining
rules.
That
kind
of
combine
those
patterns
to
yield
the
ideal
partitions.
B
And
whoops,
and
so
once
we've
done
that
now
we
move
into
actually
doing
the
search
and
the
search
is
done
on
an
implicit
search
graph.
We
don't
actually
materialize
the
whole
graph.
B
That's
that's
not
necessary,
so
that
the
a
node
in
this
search
graph
is
actually
the
bit
vector
representing
all
of
the
nodes
in
the
model
that
you've
already
accounted
for,
and
so
basically,
what
we're
saying
is
every
path
into
a
particular
node
in
the
search
graph
has
by
some
combination
of
candid
of
partitions
has,
if
you
like,
covered
all
of
these
nodes.
So
we've
we've
already
decided
that
somehow
we
know
what
to
do
with
this
subset
of
the
model.
And
now
the
question
is:
what
do
we
do
with
the
rest?
B
And
so
the
the
edges
out
of
these
search
nodes
are
all
the
possible
candidate
partitions
which
can
slot
in
at
that
point
without
intersecting
anything
that
we've
already
accounted
for,
and
obviously
you
don't
want
to
waste
time
kind
of
you
know
if
you,
if
you
apply
one
partition,
rule
and
then
another
partition
rule
well,
that's
the
same
as
applying
them.
The
other
way
around.
So
there's
there's
tricks
in
there
to
make
sure
that
you
don't
waste
time
searching.
You
know
possible
rewrites
that
are
obviously
commutative.
B
Everything
I've
written
here.
Yes,
these
are
just
regular
relay
operators.
The
relay
operator
is
okay,.
C
D
B
And
where
are
foot
star,
I
just
mean
it's
just
my
notation
for
saying
I'm
rewriting
just
this
sub
graph
and
the
star
whatever
whatever
is
inside
the
star
is
not
part
of
the
subgraph
internally.
It's
not
represented
as
these
expressions
it's
represented
with
these
bit
vectors.
B
Right
and
so
just
using
classic
dark
straight,
you
basically
just
lazily,
explore
this
search
graph.
You
start
with
the
starting
state
has
no
covered
nodes.
The
ending
state
has
every
node
accounted
for
at
every
node.
You
simply
enumerate
all
of
the
candidate
partitions
that
can
slot
in
there
without
violating
any
of
the
rules,
and
you
keep
track
of
the
cumulative
costs
for
the
best
path
and
with
any
luck.
B
If
there's
a
low
bloom
in
your
in
your
search,
you'll
kind
of
the
search
will
narrow
down
and
find
a
path
to
the
finished
state
without
having
to
explore
the
whole
space.
B
We
are
actually
doing
that's
an
excellent
question,
we're
actually
doing
auto
tuning
or
currently
just
auto
tvm
on
the
fly,
and
so
and
that's
you
know
this.
This
is
where
I'm
a
little
worried
because
for
auto
tvm,
it's
not
too
bad,
but
for
the
newer
meta
schedule,
machinery
every
candidate,
tvm
kernel
will
be
treated
as
its
own.
B
You
know
tuning
task,
and
so
we
are
going
to
be
exploring
a
lot
more
of
those
than
tbm
would
just
left
to
its
own
devices,
because
tvm's
fuseops
is
always
eager,
whereas
in
this
world
we
have
many
more
possible,
you
know
candidate,
kernels
to
to
try
and
tune
for,
but
yes
currently
currently
I
I
haven't
made
any
face
distinction
here.
When
you
as
collage
is
searching,
it
may
find
a
particular
candidate.
E
Mark
on
that
point,
will
that
be
still
covered
by
the
cash
ship
mechanism
in
the
constitution?
Estimator,
assuming
that
you
know
in
a
subsequent
search
in
the
college
and
if
he
decides
to
tune
the
same
operator.
B
Yes
right,
absolutely,
yes,
so
all
of
the
estimate,
the
cost
estimator,
is
obviously
backed
by
a
cache
here
at
octoml
we
have
a
cache
that
has
visibility
across
all
of
the
models
and
all
of
the
targets,
and
so
one
would
hope
that
we
get
a
good
hit
rate
on
that,
but
even
taking
that
aside,
certainly
when
you're
tuning,
you
know
a
lot
of
these
deep
models.
B
Right
so
there's
an
abstract
cost
estimator
interface,
which
basically
given
an
ir
module,
gives
you
a
double,
that's
pretty
much
it
well
an
ir
module
and
a
target
that
in
the
prototype
there's
only
one
instantiation
of
that
interface
and
it
just
runs
using
the
the
public,
tvm
local
runners
and,
and
it
actually
bottoms
out
into
the
standard
benchmarking
machinery.
That's
in
python
internally
talk
to
ml.
B
So
that
folks,
can
you
know
adjust
that
as
they
need
to
the
caching
in
the
prototype,
which
I
will
probably
make
the
kind
of
default
I'll
have
to
clean
it
up
a
bit,
because
currently
it's
a
little
bit
too
hard-coded,
but
the
caching
and
the
prototype
works
by
just
a
naive,
in-memory
cache,
coupled
with
a
little
bit
of
hackery
that
I've
done
to
use
the
standard,
auto
tvm
tuning
records
as
a
cache
as
well.
B
The
net
result
is,
if
you
I
could
check
in
a
kind
of
a
cached,
I
could
check
in
a
representation
for
the
auto
tvm
tuning
for
a
bunch
of
examples,
but
I'm
thinking
not
to
try
and
check
in
a
cache
for
any
of
the
collage
candidate
partitions,
because,
honestly,
it's
pretty
cheap
to
measure
those.
The
most
expensive
thing
in
the
collage
search
is
the
tvm
tuning,
not
the
kind
of.
Let
me
compile
and
evaluate
how
quick
this
is
for
cutlass
and
things
like
that.
That's
all
pretty
quick.
E
I
have
another
question
so
after
the
after
candidate
partitioning
is
searched,
probably
let's
say
we
end
up
with
a
better
partitioning,
will
collage
or
will
this
work
consider
merging
any
compiled
region?
If
that
was
an
original
desire
of
that
target,.
E
So
so
when,
when,
when
you
find
a
better
partitioning
in
the
graph
and
it's
18
volts
saturn
by
oc
target
that
originally
had
the
desire
to
merge
the
compiler
regions.
So
after
the
search
is
concluded
this
week,
merged.
B
Yes,
there
is,
there
is
actually
a
cleanup
pass
that
mergers
adjacent.
So
so,
even
though
during
search
I
only
looked
at
little,
you
know
what
I
started
calling
ideal
partitions.
If
you
end
up
finding
oh
yeah
collage
collage
tpm
the
long
stretches
of
tvm
little
candidate,
kernels
yeah,
they
they
all
get
kind
of
joined
together.
So
it's
a
little
bit
like
what
merge
compiler
regions
already
does.
B
Yeah-
and
you
know,
based
on
the
simplifying
assumptions
doing
so
should
be
neither
here
nor
there,
it
might
save
a
little
bit
of
you
know,
might
save
a
bit
of
kernel
transition
time.
Those
are
supposed
to
be
pretty
small,
but
it
shouldn't
if,
if
by
doing
that,
suddenly
whoa
wait
a
minute
everything's
dramatically
faster.
Well,
that
means
collage,
probably
should
have
been
searching
on
larger
candidates
to
begin
with,.
B
Okay-
and
I
think
I'm
not
sure
I
have
any
more
slides-
do
it:
oh
yeah,
most
important
slide
yeah,
so
so
let
me
kind
of
just
you
know,
put
some
put
some
disclaimers
just
so
that
there's
no
disappointment.
B
So
yeah,
as
I
said,
you
know
we,
we
have
to
be
careful
in
only
looking
at
fairly
smallish
subgraphs.
I
think.
Currently,
I'm
like
n
equals
four.
Maybe
we
can
push
it
to
n
equals
six
or
something.
B
B
Well,
we
might
never
explore
that
and
and
which
means
that
the
user
may
have
been
better
off
just
running
partitioned
for
that
candidate,
that
tool
chain
in
the
first
place
so
yeah.
Well,
I'm
still
hopeful
that
that
won't
be
a
problem,
but
the
proof
is
in
the
measurements.
B
Lots
of
people
bring
up
very
legitimately.
What
a
minute
that
you
know
my
particular
byoc
needs
this
particular
layout,
or
it's
only
running
on
this
particular
device.
So
can
you
since
you're
already
searching
over
partitions?
Can
you
extend
that
search
to
also
be
in
terms
of
like
device,
placement
or
layout
or
memory
scope
and
all
the
other
kind
of
choices?
B
And
yes,
we'd
love
to
do
that?
Not
in
this
version?
B
There
is
an
approach
to
doing
this,
which
we
could
try
in
a
v
next,
but
for
the
moment,
I'm
just
simply
declaring
sorry,
that's
a
scope,
and
so
that's
this
means
that
for
some
targets
like
nvidia,
you
probably
want
to
first
apply.
You
know
a
global
layout
and
then
enter
the
main
collage
partitioning
and
then
proceed,
and
I
should
mention
that
just
because
collage
is
doing
search
doesn't
mean
all
search
has
to
be
done
by
collage.
B
There's
many
layers
of
choices
to
be
made
and
third
limit
is
yeah,
we're
very
much
in
the
the
auto
tuning
world
here
I
know
there
are
lots
of
folks
who
need
kind
of
a
more
classical
compiler
tool
chain
that
doesn't
involve
waiting
a
few
hours.
Absolutely,
I
think,
at
this
point,
collage
is
not
going
to
be
for
you.
B
Theoretically,
the
cost
estimator
could
be
replaced
with
an
analytical
model,
but
I
think
it's
firmly
in
research
territory
as
to
what
that
analytical
model
could
be,
and
then
final
limitation
is,
we've
tried
as
much
as
possible
to
just
piggyback
directly
on
byoc
they're.
You
know,
because
the
interface
isn't
particularly
firm,
there's
a
lot
of
variation
into
in
how
folks
have
done
it,
and
we've
had
to
make
a
few
adjustments.
B
I'm
taking
it
upon
ourselves
to
make
those
adjustments,
as
as
we
go
without
breaking
backwards
compatibility.
That's
about
all
I
had,
I
feel
like.
I
probably
should
have
paused
more
for
questions.
C
I
have
one
question
actually
yeah:
supposing
that
we
identify,
you
know
a
model
that
has
a
set
of
somewhat
identical,
parallel
graphs
that
you
could
offload.
Is
there
a
mechanism
by
which
we
sort
of
estimate
sort
of
the
latency
of
running
those
sections
in
parallel,
if
they're.
B
Yeah
yeah,
that's
a
that's,
there's
another
whole
kind
of
sub
area.
Here
of
you
know
how
flex,
when
I
say
partitioning,
what
do
I
mean
do
I
do
I
mean
that,
yes,
you
can
explore
serial
versus
parallel,
there's
also,
I
I
don't
even
talk
about
this
in
the
rfc
notions
of
inlining.
You
know
like
hey,
I
see
a
a
a
a
reshape,
which
is
then
shared
400
times
or
in
gpt2
like
38
times
or
something
ridiculous.
B
Should
I
be
exploring
well
do
the
reshape
and
then
share
the
result,
or
should
I
inline
the
reshape
into
all
of
the
consumers?
So
basically
none
of
that
we're
doing
so.
At
this
point
all
we're
searching
over
is
like:
where
does
the
cookie
cutter
go,
and
you
know
what
color
are
the
cookies?
You
know
what
target
does
that
thing
go
to
and
that
that's
it
so
yeah
more
more.
C
B
B
So
you
know
you
may
want
to
say:
okay,
well,
here's
my
partition,
but
also
do
some
copying
in
and
maybe
put
a
you
know
a
power
here
to
separate
so
long
as
you
can
do
that,
and
you
can
estimate
the
latency
by
measuring
that
thing
in
itself.
You
know
you
don't
need
to
go
back
and
measure
the
whole
thing
as
long
as
you
stay
within
those
lines.
You're
still
in
this
nice
friendly,
dynamic
programming
world-
and
I
think
that's
all
you
know
in
due
course.
We
could
explore
that.
B
I
want
to
have
the
empirical
stuff
a
very
firm
foundation
on
empirical
stuff,
because
I
think
it's
too
easy
to
fool
ourselves,
but
when
you
start
getting
into
worlds
where
you
know
the
choice
I
make
here
has
a
dramatic
global
change
and
I
can
only
measure
the
effect
of
that
change
by
measuring
the
overall
model,
latency
you're
you're
well
well,
outside
of
dynamic
programming,
world
and
yeah.
A
Any
other
question
or
anything
I
guess
you
can
just
mute
yourself
and
ask
or
raise
your
hand
or
type
your
question.
There
are
many
options.
B
It
was
yeah,
no
worries.
C
Yeah,
should
we
allow
anyone
to
introduce
themselves
with
their
they're
new
here
and
want
to
say
hi.
That's
the
only
other
thing
I
could
think
of.
We,
we
missed
from
our
stardom
thing.
F
Hello,
yes,
I'm
new,
I'm
george!
I
recently
started
working
with
the
compilers
team
within
arm.
Yes,
so
happy
to
be
here.
This
is
a
community
of
faces
as
well.
F
A
No,
probably
not
so
I
guess
we
can
call
it
today.
B
Yeah
thanks
everyone
and
feel
free
to
just
send
any
questions
or
comments
or
whatever.
I
think
the
rc
is
closed.
So
that's
not
very
convenient,
but
on
the
original
discuss
post.
C
If
you
can
find
it,
I
I
took
some
notes
on
the
the
conversation
today
and
so
not
another
presentation
part,
but
just
any
of
the
conversation
bullet
points
I'll
I'll
post
up.
Some
brief
notes
on,
I
guess,
on
the
rfc
thread
or.
C
That
it's
hard
to
take
rabbit
on
it
yeah
and
actually
it's
worth
mentioning
here,
just
to
kind
of
along
the
lines
of
the
what
what
leandro
was
asking
about
kind
of
at
the
beginning
of
the
meeting.
One
thing
that
we
should
get
a
little
bit
better
at
and
and
I'll
try
to
improve
this
in
the
next
couple
of
weeks
is
we'd
really
like
to
have
someone
to
be
a
host
and
then
someone
to
be
a
note
taker
and
just
identify
those
people
early
on.
And
so
you
know
it's.
C
But
especially
as
we're
starting
to
sort
of
spread
out
the
load
of
hosting
meetings,
yeah
we'll
we'll
start
asking
folks
to
do
that
too.
So,
that's
another
role
that,
if
anyone's
interested
in
please
do
sign
up.
E
C
Ping,
us
so
great.
B
A
Yeah
wow-
and
I
guess
for
for
everyone
else,
reminder
that
we
are
looking
for
topics
for
for
the
agenda
to
be
composed
for
the
next
weeks.
So,
if
you
want
to
volunteer
you
can
you
can
reach
out
on
on
discord
or
you
use
the
document
as
well
that
you
see
on
the
on
the
forum
yeah,
I
guess
that's
it
for
today.
Thank
you
mark
for
the
presentation.
Thank
you,
everyone
for
attending
and
we
meet
again
next
week.