►
From YouTube: 22 - Geometric Deep Learning - Or Litany
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
Thanks
for
the
introduction
should
I
just
the
one
I'm
wearing
I
closed
this.
Can
everyone
hear
me
awesome
all
right
so
yeah
thanks
for
the
introduction,
I
always
try
to
hide
the
fact
that
I
did
my
bachelor
in
physics,
because
then
you
know,
people
were
expected.
I'll
know
something
about
physics
now,
but
no,
you
outed
me
so
yeah
feel
free
to
ask
questions
about
that.
A
So
there's
some
abuse
of
notation
about
what
geometric
deep
learning
is,
and
it's
actually
fun
to
be
in
that
field,
because
then
you
know
we
can
just
kind
of
collect
all
those
things
that
fall
from
the
other
traditional
deep
learning
right.
So
it's
it's!
It's
a
lot
of
fun,
I
like
to
say
to
think
about
that
at
anything
that
doesn't
fit
in
a
regular
grid,
we'd
like
to
call
geometric
dipper
and
then
that's
general
enough,
but
specific
enough
at
the
same
time.
A
So
you
can
write
grants
and
get
research
on
these
things
right,
so
I'm,
currently
a
postdoc
at
Facebook,
AI
and
moving
to
Stanford
in
September
for
a
second
year
of
a
postdoc
and
yeah.
That's
Devine
I
like
this
to
be
open
for
discussions
because
I
have
like
many
slides
and
if
you'll
just
go
through
all
of
them
I
mean
if
my
interest
you
for
a
couple
of
slides,
some
some
slides
I.
A
You
know
if
they
interest
you,
then
I
don't
have
enough
of,
and
then
we
can
just
open
this
for
discussion,
so
just
feel
free
to
interrupt
me
with
questions
and
it's
okay,
if
you
don't
finish
them,
because
they're
online,
so
starting
with
3d
I
mean
I,
always
like
to
go
back
to
this
to
this
paper.
That's
kind
of
you
know
to
cook,
not
we
I'm
not
part
of
this
paper,
but
they're
authors
on
this
paper.
It
took
a
bunch
of
images
and
then
kind
of
rebuild
a
3d
model.
A
I
always
like
to
look
at
this
as
an
example
to
how
we
keep
representations
in
an
efficient
way
right.
So
there's
something
a
little
bit
unsatisfying
about
just
keeping
all
those
images,
because
in
the
end,
they're
just
there
as
some
projection
of
what's
there
in
3d,
so
I
always
like
to
start
with
saying
that
is
just
a
better
representation
of
the
world.
A
It's
also
important
to
understand
where
we
come
from
when
we
say
3d,
because
because
many
times
when
researchers
have
talked
about
3d,
we
thought
of
you
know
a
handful
of
clean,
meshes
and
and
synthetic
objects
that
artists
created
for
us
to
work
on
and
not
for
us
actually
for
themselves.
But
then
we
took
those,
but
only
a
few
right,
so
like
no
messy
massive
data
collections
who
actually
work
on,
but
nowadays
I
mean
similar
to
kind
of
like
the
transition
that
images
have
seen.
You
know
what
I
guess.
A
In
the
old
days,
people
have
tried
to
create
models
of
images
by
using
Springs
and
masses
and
stuff
like
that.
But
then
you
know
everyone
has
a
camera
in
their
pocket.
So
let's
use
that
I
think
this
is
something
that
we
see
in
3d
starting
to
happen.
So
this
is
the
right
time
to
kind
of
prepare
for
the
revolution
of
3d.
So
this
new
iPhone
already
has
a
scanner
in
the
front
camera
that
produces
pretty
dense
point
cloud.
A
If
you
want
to
use
like
there
are
apps
out
there,
that
can
only
give
you
that
that
could
already
give
you
that
in
data,
so
we're
almost
ready
right
and
it's
not
just
the
this
camera.
So
we
have
a
bunch
of
sensors
and
now
they're,
smaller
and
cheaper,
and
it's
not
only
about
acquiring
the
data,
because
we
also
need
to
ask
ok.
So
we
require
data.
What
can
we
do
with
it?
So
now
we
can
have.
We
have
stunts
like
VR
and
AR
devices,
so
we
can
enjoy
that
data.
We
can.
A
A
A
Start
with
maybe
addressing
one
key
property
of
3d,
that's
different
than
2d,
because
I
think
you
know
in
2d
probably
heard
a
lot
this
week
about
different
ways
to
process
the
data.
But
somehow
everything
starts
with
a
continent,
and
one
reason
that
that's
possible
is
the
fact
that
images
are
represented
on
a
regular
agreed
and
that's
kind
of
like
agreed
by
everyone.
So
we
know
that
the
camera
manufacturers
they're
not
gonna,
do
anything
radical
for
the
next.
A
You
know
the
next
version
of
the
iPhone
will
still
have
an
array,
and
that
means
that
we
can
keep
our
algorithms
still
working
on
arrays.
That's
fine,
they're
gonna
be
2d
they're,
not
gonna,
invent
any.
You
know
some
sort
of
a
weird
thing,
because
that
will
influence
an
entire
complex
pipeline.
If
really,
we
don't
have
that
agreement
all
right
so
because
it's
kind
of
maybe
early
stage
and
also
not
because
it's
early
stage
also
because
it's
not
always
appropriate
to
use
the
same
representation
because
you
have
different
needs
in
there.
A
So
without
this
agreement
we
were
left
with
a
bunch
of
representations.
These
are
the
ones
that
are
that
I'm.
Showing
here
are
probably
the
most
common
ones,
so
right
so
I'll
talk
about
so
one
of
them
is
two
images
right,
like
the
first
slide,
I
show
they're
just
a
bunch
of
images
that
represent
some
underlying
3d
object,
that's
still
valid,
so
we
can
still
use
that.
A
If
not,
then
we
have
point
clouds
and
I'll
talk
about
each
of
them
right,
meshes,
voxels
and
I'll
briefly,
mention
also
level
sets,
maybe
not
so
briefly,
depending
on
your
questions,
all
right,
so
let's
maybe
take
extra
minute
to
maybe
go
through
a
bunch
of
things.
So
it
was
importantly
when
preparing
the
slides
to
kind
of
push
the
message
that
Phoebe
is
not
just
these
guys.
It's
not
always
the
case
where
you
get
a
3d
or
you're
trying
to
output,
something
it
VD
is
when
you
need
3d,
ok
and
what
do
I
mean
by
that?
A
So,
let's
take
this
example
from
you
know.
You
have
an
image
and
let's
say
you
want
to
do
some
kind
of
enhancements,
denoising
depth
of
field
whatever,
and
we
can
think
even
with
you
know,
even
if
your
model,
assuming
you're
doing
your
solving
that
problem
with
with
a
neural
network,
even
if
the
model
runs
in
2d
honor
to
the
input
and
outputs
2d,
it
has
to
understand
something
about
the
3d
or
the
depth
of
the
scene.
A
So
imagine
you
have
like
two
images
and
you
want
to
say:
are
they
the
same
and
again
like
2d
methods,
working
on
images,
doing
metric
learning
from
very
well
using
features
like
color,
but
once
you've
changed
the
view
angle
significantly,
then
maybe
you
can
benefit
from
seeing
many
other
3d
models
of
cars
right.
So
you
can
imagine
some
underlying
3d
model
of
a
car
is
placed
somewhere
in
the
memory
of
the
networks
that
so
that
you
can
utilize
it
a
little
more
towards
the
explicit
side.
A
A
A
That's
the
that's
the
key
message
right-
and
this
is
a
very,
very
cool
work
from
from
from
deep
mind
from
I
guess
like
two
years
ago,
where
again
at
no
point
in
the
system
so
what's
happening
here,
is
that
they
try
to
teach
the
network
about
the
existence
of
this
scene,
but
the
way
they
do
it
is
by
only
showing
it
images
from
different
viewpoints,
and
then
they
want
the
network
to
be
able
to
generate
a
new
image.
So
again,
the
input
is
2d
and
the
output
is
also
2d,
but
with
the
camera
direction.
A
So
it
means
that
somewhere
inside
the
model
needs
to
build
a
3d
understanding
of
the
scene.
Otherwise,
how
could
it
you
know
generate
it,
but
what
type
of
a
3d
model?
It's
definitely
not
one
of
those
four
or
five
that
I
showed
in
the
first
slide
right,
you
don't
know,
we
don't
know.
So
that's
right
now,
oops
that
should
have
been
3d.
So,
of
course,
when
we
go
towards
more
explicit
3d
that
that's
clear
right.
So
here
the
input
is
2d,
but
the
output
you're
interested
is.
It
is
a
3d
point
cloud.
A
Surely
all
those
applications
that
start
Phoebe
and
try
to
do
something
like
segmentation,
you
know
correspondences
object,
detection,
that's
like
what
you
immediately
think
of
rings,
maybe
and
and
to
some
extent,
maybe
like
it's
not
fair
to
call
this
light,
VD
and
then
say
that's
beyond
3d,
but
I'm
going
outside
the
spectrum
and
we
care
about
other
things
that
are
like
especially
graphs.
It
could
be
used
to
represent
VD,
but
are
more
general
yeah.
A
Okay,
so
maybe,
let's
briefly
discuss
the
two
first
representations,
so
the
first
one
is
the
multi-view
and
the
reason
I
don't
want
to
touch
too
much
up
on.
That
is
because
you've
heard
the
entire
week
about
the
neural
network.
So
you
kind
of
already
know
how
to
solve
this.
This
thing
and
the
other
one
is
voxels.
A
So
here,
let's
say
that
the
problem
is
you
take
a
bunch
of
images
and
you
want
to
classify
this
thing,
for
example,
so
you
basically
process
each
image
in
the
penalty
using
your
your
CNN
and
then
essentially
you
don't
care
about
the
fact
that
he
was
easel
as
a
3d.
So
the
first
works
that
have
done
this
type
of
things.
They
just
said:
okay,
let's
take
the
collection
of
images,
and
the
only
thing
we
need
to
care
about
is
how
to
pull
them
together,
but
once
I
pull
them
together.
A
Right
this
is
another
another
example
in
in
this
example.
We
actually
so
here
is
again
one
of
those
examples
where
we're
somewhere
in
between
the
implicit
and
explicit
where
each
image
is
being
processed
individually,
but
the
way
they're
aggregated-
and
this
is
something
that
I
kind
of
want
to
put
out
there,
as
maybe
a
general
message
doesn't
necessarily
have
to
do
with
this.
Talk
that
if
you're
no
longer
you
know,
if
you
don't
know
what
happens
in
a
network
at
all,
but
you
only
care
about
one
output.
A
That's
like
a
classification,
then
fine
I
mean
that
that
representation
can
lie
anywhere
to
be
the
you
know
the
layer
before
the
feature
or
the
logits
or
whatever
you
want.
But
if
you
want
to
do
aggregation,
then
it
may
be
a
smart
idea
to
ask
the
network
to
give
you
to
do.
Processing
such
that
the
operation
you
want
to
do
is
easy,
and
what
do
I
mean
by
that?
So,
for
example,
here
we
took
a
bunch
of
images
and
we
represented
them
each
in
a
3d
frame
and
what
the
reason
we
did.
A
That
is
because
we
wanted
some
kind
of
a
3d
reconstruction
and
what
we
know
to
do
right,
what's
difficult
to
do
is
to
merge
two
images,
because
how
do
you
do
that?
But
if
those
are
already
3d
models,
then
it's
just
the
union
operator
right.
So
the
thing
I
need
to
ask
my
network
just
give
me
something.
Give
me
a
representation
where
it's
easy
to
unify
things,
and
so
this
is
a
general
message,
but
that's
again,
like
an
example
of
usage
of
2d
images
to
represent
an
underlying
feeding
model.
A
So
here
it's
a
classification
and
here
is
a
3d
reconstruction
and
also
that
representation
can
give
you
a
novel
view.
Render
voxel
grids
were
probably
the
first.
You
know
the
most
naive
generalization
from
2d
to
3d
right,
so
you
move
from
pixels
to
voxels.
Just
add
another
dimension.
It's
also
easy
in
terms
of
the
machinery
you
have
you
don't
have
to
introduce,
except
from
memory
which
we'll
discuss
you
don't
have
to
invent
anything
in
you
right.
You
just
add
one
more
dimension.
A
What's
one
more
dimension,
if
you
already
work
with
a
very
deep,
you
know
per
pixel
representation,
so
each
each
pixel
is
no
longer
RGB
right,
but
we
have
like
hundreds
and
hundreds
of
feature,
so
we
just
add
one
more
dimension
and
then
so
everything
stays
kind
of
kind
of
the
same
and
and
it
works
to
some
extent.
The
main
issue
with
this
thing
and
the
reason
he
doesn't
like
we're
trying
to
find
other
solutions
is
mainly
resolution
all
right.
So
here's
there's
one
example.
There
are
many
many
such
you
can
find
online.
A
A
If
that
box
is
completely
free
of
that
say,
existing
points,
then
it's
just
not
not
occupied
any
number
of
points
in
that
voxel
exceeds
a
certain
threshold,
and
maybe
you
can
have
many
many
ways
to
decide
whether
you
it's
okay
occupied
or
not
stup,
according
to
the
threshold
Oh.
Oh
sorry,
so
it's
the
percentage
of
the
occupied
voxels
right.
So
this
one
says
that
if
you
use
the
32
by
32
by
32,
then
you
see
that,
like
I
mean
more
or
less
10%
have
been
used,
but
the
rest
are
zeros
or
just
free
space
right.
A
The
issue
is
that
you
need
to
process
this
free
space
unless
otherwise
treated
you.
Don't
you
don't,
like
your
network,
doesn't
know,
what's
a
zero
and
what's
not
a
zero
in
terms
of
computation,
you're
gonna
input
that
huge
thing
ten
percent
of
this
is
actually
has
like
meaning
meaningful
values.
The
other
are
zeros,
but
you
want
to
process
everything.
A
So
whatever,
if
you
have
you
start
with
large
voxels
and
they're
completely
empty,
then
you
can
imagine
that
there's
no
reason
to
split
them
to
break
them
apart
with
smaller
vassals.
And
then
you
can
do
this
iterative
process
and
this
one
is
going
one
one
extra
step,
further
and
kind
of
computes
objects
in
a
hierarchical
way
again,
but
in
a
kind
of
like
an
onion
layers,
I
mean
the
techniques
are
interesting
in
my
mind,
whatever
is
not
simple
enough
is
just
not
not
gonna,
not
gonna
stay.
A
You
know
this
community,
like
simple
things
like
the
computers
wants
to
do
simple
operations
because
they
can
scale
up
very
quickly.
These
things
are
very
interesting
ideas,
but
I
don't
see
a
lot
of
adoption
in
the
community
to
those
techniques
right.
So
that's
that's
a
that's
work
for
voxels.
Anyone
has
questions
on
the
two
multi-view
or
voxels
all
right,
let's
go
into
the
more
interesting
stuff,
so
I
think
so.
A
Hopefully,
it's
convincing
that
voxel
is
not
the
best
way
to
go,
and
especially
if
you
want
like
a
good
representation
of
an
object,
so
meshes
graphic
people,
graphics,
people
have
already
used
meshes
for
you
know
for
so
long
because
of
exact
exactly
this.
This
reason
why
you,
just
you
know,
have
an
efficient
way
to
represent
a
single
surface
right
and
point
clouds.
A
A
Okay,
just
a
simple
example:
if
you
sampled
my
fingers
right
and
the
point
here
and
a
point
here
are
as
close
as
the
point
here
and
a
point
here,
the
difference
is
that
you
don't
want
to
connect
these
two,
but
you
do
want
to
connect
these
two.
These
two
are
situated
on
the
same
surface,
and
these
are
not
right.
A
So
you
in
order
to
understand
that
that's
right,
so
in
terms
of
the
representation,
yes,
whether
you
can
or
cannot,
fold
or
unfold,
you
know
a
2d,
a
2d
plane
to
the
entire
3d
surface
now
has
to
do
with
the
topology
of
the
3d
shape
you're
trying
to
represent,
but
yes
locally.
At
least
you
can
build
those
atlases
and
and
do
that
right.
A
So,
let's
start
with
with
pointers,
I
think
this
is
like
something
that
the
community
cares
about
a
lot
in
recent
years
and
and
I
I'd
argue
that
mainly
it's
the
driving
force
of
the
no
pun
intended,
but
the
driving
force
of
the
auto
driving
and
I
think
like.
So
what
are
the
challenges
when
we
want
to
work
with
with?
A
We
want
to
work
with
with
point
clouds.
That's
different
from
say,
image,
pixels
right,
so
a
we
don't
want
to
first
process
it.
So
one
exam
one
thing
with
it
we
could
do
is
we
can
convert
this
to
a
voxel
agree
right?
You
just
place
your
grid
around
the
object
and
then
you're
back
at
the
voxel
grid.
So
we
have
a
solution
for
that
great,
but
you
want
to
go
raw
because
you
don't
want
to
invent
data,
you
don't
want
to
lose
resolution,
you
want
to
be
efficient
and
you
don't
want
to
process
zeros.
A
That's
that's
another
thing
that
so
that's
the
unstructured
right.
Another
thing
that
point
sets
are
sets
right,
so
they
don't
have
a
defined
order.
So
that's
also
something
we
need
to
address
right.
If
there's,
no,
if
I
put
a
grid,
I,
essentially
hint
that
there
is
an
up
direction
and
there's
an
order
in
which
I
can
process
those
points,
if
I
want
to
push
them
to
a
recurrent
neural
network
and
treat
them
like
an
anon
like
text
inputs
or
audio
input,
it
has
order.
A
This
is
just
a
set
right,
whether
I
permute
it
or
not,
shouldn't
matter
to
the
algorithm
so
right.
So
how
do
we?
How
do
we
treat?
How
do
we
handle
this?
This
thing
so,
a
few
years
back
there
was
this
point
at
paper
that
came
out.
I
think
was
the
kind
of
the
original
paper
that
started
the
whole
rush.
44.5
networks-
and
it
start
with
this,
like
very,
very,
very
simple
observation-
is
that
if
you
want
to
be
invariant
to
the
order
of
the
points,
then
essentially
you
needs
to
only
work
with
symmetric
functions.
A
Right
and
a
symmetric
function
is
essentially
what
I
just
said:
right:
a
function
that
doesn't
care
about
the
order
all
of
the
input
it
produces
the
same
the
same
out
and
some
easy
examples
of
such
function
will
be.
For
example,
if
I
give
you
a
set
of
values
and
I,
just
ask
you:
what
is
the
maximum
value
out
of
all
those
values?
Then
you
don't
care
about
the
order
in
which
I
gave
you
those
bad
right.
A
Another
thing
you
can
do,
just
you
know
just
combine
them
together,
add
them
together,
and
there
are
many
of
those.
So
that's
exactly
what
point
it
is
proposing
is
proposing
to
do
so.
You
start
with
your
n
by
3
representation
of
a
point,
as
you
have
on
endpoints
in
3d,
so
and
by
to
be
right
and
the
first
thing
that
the
network
does.
If
you
can
see.
A
I,
don't
have
a
pointer,
but
the
first
thing
that
the
network
does
is
processes
each
point
independently.
All
right,
that's
the
other.
That's
the
other
thing
that
you
can
do
that.
Keeps
you
equivalent
to
the
order
right.
So
one
option
is
to
you
know,
make
an
operation
like
adding
all
the
points
together
and
then
your
environment,
because
you
don't
care
about
the
permutation,
but
another
thing
you
can
do
is
be
equivalent.
So
you
do
the
operation
point
wise
right
and
if
you
per
moved,
then
it
changes,
but
it
changes
in
the
same
way.
A
So
if
you
first
do
the
operator
and
then
permute
or
if
you
first
permute
and
then
do
the
operation,
you
get
the
same
result,
and
so
that's
another
kind
of
a
legal
operator
that
you
can
do
so.
You
can
take
the
endpoints
and
process
each
of
them
individually,
say
by
passing
them
through
a
multi-layer
infrastructure
on
each
of
them
separately.
Right.
But
now
you
can
introduce
one
of
those
symmetric
functions,
say
the
the
max
per
channel
max
right.
So
now
you
let's
say
you
went
from
n
by
3
and
I'm
skipping.
A
A
Sorry,
oh,
it's
just
the
number
of
channels
at
each
layer
of
the
MLP
right,
so
you
start
with
3
and
then
you
have
like
essentially
a
matrix
multiplying
from
3
to
16
or
whatever.
So
this
will
be
like
take
your
64
to
124
1028,
yes,
just
on
design
choice,
yeah,
right,
yeah
and
there's
some
some
interesting
arm
observation
in
the
paper
about.
A
So
the
property
of
this
thing
essentially
says
that
you
don't
lose
a
lot
right
or
you
can
be
as
close
as
possible
to
do
to
the
function.
You
actually
want
to
represent
well
still
being
ordering
variant,
and
maybe
you
know
a
little
intuition
about
that,
because
when
I
first
read
a
paper,
it
was
kind
of
puzzling
to
me,
for
you
know
why
is
it
even
useful
to
process
each
point
individually
right?
So
what's
the
point?
A
It's
really
gonna
tell
you
it's
just
a
point
right,
so
if
it
doesn't
communicate
with
any
of
the
other
points,
what
kind
of
information
can
it
can
it
you
know
hold,
even
if
you
take
the
three
dimensions
and
represent
it
in
a
1024
right,
so
one
interesting
intuition
behind
that
part
is
let's
take
this
simple
example:
let's
imagine
that
the
network,
so
one
option
right,
so
it's
just
choosing
another
parametric
way
to
represent
the
point
cloud
and
here's.
The
simplest
example
say
that
what
the
network
is
actually
doing
it
takes
up.
A
It
places
effectively
agreed
around
the
point
cloud
right,
and
how
can
you
do
that?
So
if
each
point
is
a
point
in
space,
then
the
network
can
simply
say:
ok,
let's
say
there
have
1024
values
to
represent
that
point.
So
I
can
think
of
like
of
like
a
3d
grid
with
indicator
functions
representing
like
1
and
0
elsewhere,
and
if
the
point
XYZ
is
situated
inside
one
of
those
boxes,
I'm
gonna
in
the
1024
dimensions,
I'm
gonna,
place,
1
and
0
elsewhere.
A
Right
so
I'm,
not
saying
that
that's
what
the
network
is
doing,
I'm
saying
that
that's
one
trivial
way
for
a
representation
to
take
a
3
coordinates
and
just
represent
them
differently
in
a
thousand
dimensions,
but
that
now
has
a
meaningful
representation
of
the
shape,
because
if
we
have
that,
then
just
doing
a
max
pool
will
exactly
give
you
the
voxel
ID
shape
back
it.
Does
that
make
sense?
A
So
that's
one
way
that
that
at
least
I
could
get
my
head
around
like
why
it
should
even
be
used
now
along
the
network
to
train
and
do
something
more
fancy.
Then
you
can
think
of
things
like
hey.
Maybe
you
know
similar
to
what
people
have
done
in
dictionary
learning
and
then
techniques
like
that?
Maybe
they're
more.
A
Recurring
structures
in
3d
right
so,
for
example,
if
I
see
a
bunch
of
points,
it's
wasteful
to
represent
each
of
them
in
its
own
voxels,
but
I
can
have
I,
can
think
of
them
as
like
having
multiple
evidences
for
the
existence
of
a
plane
right.
So
now,
I'm,
not
gonna
waste,
whatever
15
dimensions.
Out
of
my
want
1024
to
represent
that
plane,
but
I'm
gonna
just
have
one
indicator
function
for
a
plane
at
a
certain
degree
at
a
certain
location.
A
A
You
can
play
some
kind
of,
but
for
a
soma
Network,
so
a
first
network
that
sees
the
point
out
that
decides:
okay,
I'm
gonna,
rotate
it
in
a
certain
degree
and
they
added
one
both
to
handle
the
input
and
both
and
and
also
the
hand
of
the
features
later
on.
So
the
first
one
is
like
very
intuitive.
The
other
one
is
less
so,
but
still
seems
to
help,
and
another
question
is
you
can
ask,
is
okay
that
global
feature
can
allow
me
to
do
to
answer
global
questions
like
what
is
that
shape
right?
A
I
can
do
that.
I
could
do
classification,
but
what,
if
I
care
about
going
back
to
the
original
resolution?
I
say
something
about
each
point.
Maybe
what
part
of
the
shape
that
point?
That
point
belongs
to
so
in
order
to
do
that,
they
just
do
a
simple
trick.
They
take
the
global
feature
and
then
concatenate
it
to
each
of
the
individual
points.
Alright,
so
they
go
back
to
the
original
endpoints
or
from
some
intermediate
stage
of
where
there
were
like
64
dimension
and
they
just
concatenate
the
same
feature
over
and
over
again.
A
So
this
is
a
duplicate
of
the
same
of
the
same
vector,
but
you
can
think
of
each
row
here
as
now
having
some
information
about
each
individual
point
and
all
others.
So
that's
already
allowing
communication
between
each
point
and
it's
the
rest
of
the
shape.
If
you
will
and
that,
if
you
keep
that
resolution,
then
you
can
keep
processing
each
point
individually,
but
now
it's
it's
aware
of
the
existence
of
all
the
other
points,
so
you
can
get
back
results
that
are
that
are
meaningful.
A
For
example,
you
could
do
part
segmentation,
so
first
any
questions
about
those
additions,
so
yeah.
So
for
the
task
of
object,
classification
that
that
has
already
kind
of
shown,
you
know
as
the
first
work,
that's
not
doing
voxel
grades
or
not
image
based.
It
was
already
impressive
in
terms
of
the
result
that
it
bet
that
you
got.
You
know,
that's
like
one.
One
thing
you
are,
you
learned
the
hard
way
is
when,
whenever
you
wanted
to
reinvent
or
sorry
to
invent
something
new,
you
have
to
fight
with
things
that
are
so
well
engineered.
A
It's
very
hard
to
push
an
idea
to
be
state
of
the
art.
So
you
you
either
find
a
reviewer.
That's
willing
to
accept
the
fact
that
you
know
novelty
can
come
at
the
expense
of
maybe
sometimes
being
the
best
or
you
just
have
to
see
it
on
that
for
like
a
year
before
you
can,
you
know,
push
that
so
I
think
yeah,
sorry
for
using
the
stage
for
self
frustration.
A
And
then,
and
then
as
I
said
so
this
this
shows
classification,
so
essentially
the
things
that
you
can
do
once
you
have
a
global
representation,
and
this
shows
what
you
can
do
once
you
did
once
you
added
this
global
representation
and
kept
processing
each
point
individually.
So
now
we
can
do
things
like
do
part
segmentation
right,
yeah.
A
A
This
is
also
a
cool
experiment
that
they
show
in
the
paper
that
gives
some
some
very
nice
intuition
on
what
the
network's
is
learning
and
that
has
to
do
with
the
specific
type
of
global
operator
that
I
chose,
namely
the
the
max
operator.
So
if
you
remember
the
max
operate,
a
symmetric
function
at
the
global
pooling
stage
that
they
chose
to
use
these
appear
is
a
pair
feature
max
operator.
Right
so,
let's
say
point
number
one
screams
the
loudest
at
feature
number
fifteen.
Then
that
feature
is
going
to
hold
for
the
global
pool.
A
So
one
experiment
that
you
can
do
is
you
can
go
around
the
existing
points
or
maybe
start
with
this,
so
you
can
start
with
your
model.
That's
the
original
point
cloud
that
you
provided
with
that
say,
thousand
points
and
now
some
points
they
never
screamed
the
loudest
right,
so
they're,
just
not
important
for
a
global
operation.
So
let's
remove
those
we
can
easily
detect
those.
We
can
just
remove
those
and
they're
asking
okay.
So
what
what
are
we
left
with?
Like?
A
Can
we
understand
what
the
model
has
learned
in
terms
of
kind
of
some
internal
representation?
If
you
throw
those
out,
we
get
a
set
of
what
they
refer
to
as
I
think
the
critical
points
or
something
like
that
and
the
critical
points
ad
shows
you
that
you
know
at
least
it
makes
some
sense.
You
get
the
boundaries
of
the
object
that
may
be.
You
know
important
parts
that
hint
towards
what
that
thing
is
because
that
was
useful
for
a
crucification
task
right.
So
that's
one
experiment:
you
can
also
do
the
opposite.
A
You
can
add
points
as
long
as
you
add
points
that
don't
change
the
global
pooling,
mainly
the
representation
of
the
model.
So
you
can
also
do
that
and
then
you
can
think
of
it
as
a
way
to
do
some
kind
of
a
densification
of
the
cloud,
but
also
I
kind
of
like
to
give
you
the
minimal
and
maximal
boundaries
of
what
you're
allowed
to
do
with
the
objects
you
you
were
given
without
changing
its
meaning.
The
way
the
network
is
thinking
of
that
I
thought
that
gives
a
little
right
right.
A
So
the
way
I
think
of
this-
but
maybe
there's
like
more
deep
difference
here-
is
that
in
this
case
the
object
is
your
scene
right
and
the
parts
are
your
objects
right,
yeah
I
mean
definitely
there's
a
different
behavior.
So,
for
example,
you
can
think
of
maybe
networks
learn
in
image
segmentation.
Maybe
they'll
learn
to
do
some
kind
of
compositions
all
right.
So
if
you
detected
people
and
cars
and
whatever
animals
in
they
repeat
at
different
places
at
different
images,
so
you
can
kind
of
learn
to
detect
those.
A
And
here
you
know,
if
you've
seen
a
wheel
of
a
skateboard.
Maybe
that
doesn't
mean
that
you're
able
to
to
use
that
later
on
on
different
different
models.
So
it's
not
like
a
pedestrian
in
one
scene.
He
will
be
also
a
pedestrian
in
a
different
scene.
Right,
that's
a
part,
but
it's
unique
to
that
object.
At
the
same
time,
there's
much
more
context,
because
if
you've
seen
a
pedestrian
enon
in
one
image,
it
doesn't
necessarily
mean
that
there
will
be
more
right.
A
But
if
there's
like
you
know,
you
just
detected
I,
don't
know
a
skateboard
kind
of
a
skateboard
part
and
and
one
of
the
wheels,
then
you
know
there
should
be
more
hips
right.
So
you
do
have
more,
maybe
more
context
if
you're
focused
on
a
single
out
that
will
not
generalize
to
object
that
are
outside
the
class
of
objects,
but
in
terms
of
variability
within
the
class.
It
generalizes
this
yeah.
A
If
people
here
like
care
specifically
about
parts
of
3d,
object
and
there's
a
recently,
a
very
large
collection
and
I
mean
ping
me
out
I'll
give
you
the
I'll,
give
you
the
links
right
so
like
4d
3d
in
in
time.
Yes,
yeah,
so
III
don't
show
here,
but
there
recently
is
a
is
a
technique
that
tries
to
focus
specifically
on
on
that
I
think
it's
called
a
Minkowski,
sparse,
convolution
Network
yeah
yeah,
following
like
the
other.
A
You
know
course
key
time,
yeah
yeah,
so
you
can
check
that
out
its
forum,
Silvio's
group
in
Stanford
and
they're,
specifically
yeah
I,
think
one
of
the
problem
is
the
datasets.
So
now
you
need
moving
dynamic
scenes
to
work
on
there's
some
recent
releases
I
forget
by
which
company,
but
there's
some
recent
releases
of
that,
but
not
as
many
as
you'd
find
otherwise.
A
A
A
So
we
could,
we
could
later
see
which
features
were
very
useful
and
we
had
to
detect
to
do
some
kind
of
a
segmentation
of
an
indoor
scene,
and
what
we
saw
is
that,
like
height
almost
immediately
reveals,
you
know
desks
right,
so
they
they
tend
to
the
same
always.
So
that's
a
very
good
hint
towards
this
thing.
A
So,
actually,
in
one
of
the
advantages
of
working
in
3ds
that
you
have
the
actual
absolute
sizes
as
opposed
to
images
where
you
know
that,
depending
on
how
far
you
are
from
the
object,
it
could
scale
here
you
have
the
absolute
absolute
sizes,
so
you
can
use
that.
That's
that's
how!
Yes!
Oh
that's
a
very
good
point:
okay,
yeah
I'll
get
back
to
that!
I'll!
Get
back
to
that!
A
No,
no
I
think
there
were
there
were
annotated
on
on
reconstructed
meshes
of
those
scenes
and
then
for
the
purpose
of
this
were.
Were
sampled
began
to
be
point
clouds
right,
yeah
yeah,
so
so,
if
it's
reconstructed
as
meshes
then
and
there's
like
plane
detections,
then
then
there
are
like
pretty
easy
tools
out
there
that
you
can
just
kind
of
paint
on
top
of
the
objects.
It's
not
a
fun
job,
don't
get
me
wrong,
but
but
you
still
have
to
like
rotate
the
surfaces
and
paint
all
over
them.
A
A
It's
just
in
some
sense,
too
easy
right.
How
do
you
generate
a
cluttered
room
with?
You
know
a
bad
that
has
kind
of
clothes
road
on
top
of
that,
and
and
you
want
that
thing
that
they
have
in
cocoa
dataset
right,
you
want
the
image
equivalent
in
a
3d,
so
I
think
it
very
fat
very
quickly
becomes
too
easy
to
to
work
on
synthetic
data
in
in
3d
for
those
tasks.
But
why
do
you
say
you
want
the
same
number
of
points,
because
the
architecture
I
showed
right
right?
A
So
so
ok,
so
there
are
a
couple
of
answers
to
this
thing
right
and
and
depending
on
what
you
care
about.
If
it's,
if
your
data
came
from
the
same
scanner
and
it's
similar
scenes,
so
you
can
trust.
The
did,
for
example,
density
is
not
a
property
that
you
that
you
worry
is
not
constant.
So
it's
pretty
much
constant,
but
sometimes
you
scan
just
little
points
and
sometimes
lots
of
points.
A
Then
you
can
either
you
know,
normalize
everything,
a
maximal
amount
of
points
and
pad
or
or
you
can
work
on
on
parts
of
the
scene.
If
they're
far
enough,
then
maybe
context
doesn't
hurt
you
so
much
or
you
can
find
a
clever
way
to
encode
that
into
Python,
for
example,
what
they
did
in
the
mean
Coastie
network
I
think
is
like
the
representation
is
similar
to
the
personal
representation
you
have
in
sparse
tensor
representations.
So
you
just
add
for
each
point
like
it's
below
like
where
it
is
in
the
batch
right.
A
So
you
can
imagine
that
all
your
examples
or
all
your
points
from
all
the
scenes
at
all
in
the
entire
batch
or
just
concatenated,
to
like
one
long
vector,
but
but
you
just
know
you
have
the
label
for
each
points
you
know
which
batch
which
scene
it
belongs
to.
Essentially
so
that's
know
just
a
practical
solution
to
that,
so
that
exists
as
well
yeah
as
for
different
densities.
That's
that's
a
different
problem.
I'd
say
because
then
you
actually
need
a
model
that
knows
to
handle
different,
as
it
will
learn
that
type
of
thing.
A
If
it's
a
GBD
right,
so
you
have
a
single
viewpoint
of
a
scene
and
you
have
the
depth
spare
point.
Then
it's
still
a
regular
grid
in
2d,
but
a
depth
spare
point.
So
you
you
probably
have
similar
density,
no
matter
how
far
the
object
is,
but
if
it's
a
lighter
that
scans
like
this
you'll,
probably
have
a
denser
density
in
the
in
the
nearby
points,
then
in
the
far
far
away
point
so
mixing
the
two
together
is
gonna,
be
very
tricky
in
one
in
one
model,.
A
Another
option
is
to
first
have
some
kind
of
a
encoder
and
then
have
your
your
sample,
your
image
to
sample
from
that
late
representation,
the
important
bit-
and
so
here
let's
maybe
skip
this
for
one
second,
because
what
I
want
to
what
I
want
to
discuss
is
is
like.
Let's
say
you
generated
a
point
cloud:
how
do
you
know
if
it's
a
good
point
cloud
right,
so
you
need
a
way
to
measure
an
order.
A
Let's
measure
of
measuring
the
loss
between
the
sets
again,
if
we
were
in
the
case
of
images,
you
can
just
place
an
image
one
image
on
top
of
the
other
and
measure
their
differences,
although
you
probably
heard
this
week
that
this
is
also
not
the
best
technique,
always
right,
but
at
least
we
have
that
as
something
to
start
with,
but
what
we
do
with
with
sets
right.
We
we
want.
A
We
want
to
do
this
type
of
assignment,
so
what
is
usually
being
done
is
is
use
what's
called
the
chamfer
distance
and
just
simple
way
to
understand
is
by
visualization.
So
essentially
you
pick
for
Eve
for
each
point.
It's
closest
it's
closest
point
on
the
target,
and
also
you
do
the
symmetric
operator.
So
for
each
point
on
the
target,
you
pick
its
closest
point
on
the
source
and
you're,
essentially
trying
to
minimize
both
both
distances
right.
A
So
that
seems
to
be
simple
enough.
Overexpression
shoes,
you
could
differentiate
your
entire
network
through
and
works
very
well
in
practice
right,
and
this
allows
for
essentially,
this
is
like
one
of
the
important
enablers
to
all
those
generative
models,
because
you
know
you
know
it's
generator
the
chair.
So,
okay,
let's
see
if
I
can
compare
it
to
what
I
started
with
if
it's
a
VA
ethers
right,
another
cool
thing
that
you
can
do
once
you've
generated
some
kind
of
a
point
out.
A
A
Right,
one
application
that
that
goes
more
into
the
regime
of
entire
scenes
is
object,
detection
and
I'm
going
to
touch
upon
that
RGB
thing
very
soon.
So
in
this
work
we
were
interested
in.
You
know
we
had
a
point
out
as
input
and
we
were
trying
to
detect
the
location
of
3d
objects,
and
let's
assume
that
we
can
all
agree
that
our
presentation
of
how
an
object
is
detected
is
by
finding
its
center
and
its
bounding
box
and
I
mean
let's
not
open
for
discussion.
What
is
a
better
representation
for
the
existence?
Does
that
endless?
A
And
essentially
you
know,
one
thing
you
could
do
is
just
go
directly
from
the
points
into
those
things,
but
that
doesn't
seem
to
work
that
well
and
in
fact
like.
Instead
of
the
art
methods
they
currently
were
based
on
either
you
first
detect
the
objects
in
2d,
and
then
you
search
along
3d,
and
that
requires
RGB
or
you
do
everything
in
a
voxel
grid,
and
what
we
found
is
that
one
reason-
and
that
has
to
do
with
you
know
with
scanning
artifacts-
is
that
that
is
hard
for
you
to
do
object.
A
Detection
is
that
for
object,
detection
you
need
to
do
you
need
to
pass
through
a
stage
of
proposal
that
comes
from
a
point
inside
the
object
right.
So
if
you
I
think
yesterday,
you
probably
heard
about
object
detection
in
2d,
you
probably
were
told
that
you
know
steadily.
Art
methods
they
currently
proposed
from
a
pixel
and
a
pixel
belongs
to
the
object
is
something
you
usually
have.
If
you
have
an
object
in
a
scene
that
sorry
the
center
of
the
bounding
box
is
one
of
the
pixels
you're
observing.
A
That's
not
necessarily
the
case
in
in
3d.
Point
clouds
right,
for
example,
look
at
this
look
at
this
scan
of
this
scene
right.
So
in
many
cases
the
center
of
the
bounding
box
of
the
object
is
not
a
point
in
the
scene
right.
Even
if
you
have
all
the
points
of
the
object
the
center
it
doesn't
have
to
be
on.
A
The
is
not
necessarily
a
part
of
the
scan
right
just
floating
in
mid-air,
and
that's
like
kind
of
like
one
of
the
differences
and
and
maybe
when
I
said
earlier,
we're
not
using
voxels
because
they
have
too
many
zeros.
Sometimes
the
zeros
are
useful
right
because
you
want
to
hang
some
information
there,
and
so
how
do
you
deal
with?
How
do
you
cope
with
that?
A
In
the
case
of
we
need
phone
tiles,
so
one
option
and
that
the
reason
I'm
bringing
this
up
is
because
it's
until
now,
I
talked
about
points
as
the
existence
of
an
like.
If
point
is
somewhere
in
space,
that
means
that
there
is
an
object
there,
but
that's
not
necessarily
the
case
right.
So
what
we
did
here,
for
example,
is
we
just
invented
new
virtual
points
and
you
can
invent
any
meaning.
You
want
to
those
points
here.
A
What
we
did
is
we
took
the
scene
points
a
subsample
of
them
and
we
allowed
each
scene
point
to
vote
for
where
it
thinks
the
object
of
that.
The
center
of
the
object
is
so
that
created
a
bunch
of
new
virtual
points.
They
float
in
space,
they
have
a
3d
existence
and
location,
but
the
information
that
they
carry
is
the
information
about
where
the
point
thought
is
the
center
of
an
object,
and
the
reason
this
is
useful
is
now
that
you
look
at
those
clusters
of
points.
You
can
think
okay.
This
is
a
great.
A
This
is
like
a
meeting
place
for
all
those
points.
It
didn't
it's
not
a
point.
I
sampled
I
couldn't
work
on
it.
I
couldn't
process
it
in
the
beginning,
but
now
I
created
it.
So
it
has.
Some
I
have
like
a
place
in
space
to
hang
all
that
information
and
also
like
a
meeting
place
for
all
those
points
to
discuss
with
each
other
and
together
jointly
agree
on
where
we're
like,
probably
what's
the
more
precise
location
and
other
features
like
the
the
bounding
box
dimensions
of
an
object
is
so.
A
A
Instead
of
proposing
a
bounding
box,
it's
proposes
like
each
each
point
in
the
scene
tries
to
find
some
kind
of
a
feature
or
presentation
of
what
he
thinks.
The
object
looks
like
and
then
uses
kind
of
a
generative
way
to
represent
that
object.
So
he
things
like
okay,
here's
the
you
know
that
point
here
on
that
object
and
what
I'm
going
to
do
is
instead
of
proposing
a
box
I'm
going
to
just
propose
a
bunch
of
points,
hopefully
they're
close
enough
to
where
the
object.
A
A
This
is
just
from
experience
of
myself,
colleagues
and
and
and
other
researchers.
I
talked
I
talked
with.
This
is
a
huge
gap
now
in
in
3d
of
how
to
use
RGB
to
do
proper,
say,
object,
detection,
it's
not
it's
not
clear.
So
what
happened
until
this
work
that
just
recently
that
we
recently
had
is
that
just
result
results
in
general
and
object.
A
When
we
try
to
also
add
RGB,
we
couldn't
see
significant
improvement
and
one
reason
to
that
was
that
there's
not
a
whole
lot
of
data
here
and
RGB
sometimes
have
like
a
very
good
clue
on
where
the
object
is
so
you
can
over
fit
very
quick
another.
Another
explanation
is
the
difference
in
resolution.
So
in
2d
images
you
have
like
a
very
dense
resolution,
but
the
3d
point
cloud
is
usually
has
a
much
much
lower
resolution
and
also
we
often
subsample.
A
So
we
have
less
resolution
here
and
then
how
do
you
take
the
high
resolution,
information
from
the
image
and
push
it
into
the
low
resolution?
Point
cloud
is
another
question
writing
to
find
a
good
way
to
do
that.
A
third
explanation
I
think
all
of
them
coexist.
It's
not
like
that.
One
thing
you
know
is
true,
and
another
is
false.
A
Clinician
is
come
comes
from,
say
you
wanna
resolve
the
problem
of
fine-tuning,
so
you
say
of
a
story
of
overfitting
and
you
say:
okay,
let's,
let's
pre
train
the
network
on
something,
then
here's
like
a
very
general
question,
we're
trying
to
ask
what
is
a
meaningful
task
to
train
a
2d
image
on
so
that
the
clue
you'll
have
is
a
geometric.
You
understand
what
I'm
saying
so.
Essentially
there
works
showing,
for
example,
that
if
you
let's
say
you
want
to
train
you're,
gonna,
pre,
train
on
calcification,
say
I,
don't
care!
A
Give
me
two
images
from
imagenet
I'm
gonna
train
for
classification.
Is
that
going
to
be
helpful
to
solve
a
3d
detection
problem?
It's
a
good
question
right.
So
previous
works
show
that
these
networks
for
classifications,
they
care
a
whole
lot
about
a
whole
lot
more
about
texture
than
they
care
about
geometry.
I.
Think
one
example
was:
if
you
take
you
know
an
elephant
and
just
replace
this
texture
with
the
texture
of
a
giraffe.
Then
the
network
will
still
think
it
would
will
think
it's
a
giraffe
hinting
that
it
ignores
the
overall
geometry
of
the
same.
A
So,
in
a
case
where
the
task
you're
actually
interested
in
is
geometric
in
nature,
you
probably
don't
want
to
solve
classification
or
not
in
the
usual
way.
So
I
think
it
opens
up
a
very
interesting
kind
of
area
for
research
of
what
is
a
proper
geometric
task
to
perform
on
to
the
images
so
that
the
features
you
learn
are
transferable
to
other,
like
to
other,
to
3d
or
or
in
general,
are
useful
right,
so
I
think
I
think
the
problem
we
saw
is
that
it
improves
it
too
much
right.
A
So
it
learns
to
kind
of
to
ignore
the
3d
information,
or
you
know
very
quickly-
over
fits
your
train
set,
so
it
doesn't
generalize
to
your
test
set.
But
if
you
have
tons
and
tons
of
data
to
train
on
then
then
yeah,
then
that's
on
the
RGB
is
used,
and
maybe
I
should
have
also
said
that
just
adding
RGB
is
very,
very
simple,
because
until
now
we
talked
about
the
input
to
the
network
is
n
by
3.
A
But
you
can
think
of
each
point
if
it
carries
some
RGB
information
or
some
other
handcrafted
descriptor
you
you
can
compute
from
it,
then
it's
just
n
by
9
or
whatever
right.
No,
so
it
is
annotated.
The
bounding
box
is
given
and
annotated,
so
we
can
definitely
supervise
for
for
the
centroid.
Even
if
it's
not
a
point
in
the
scene,
it's
the
centroid
of
the
bounding
box
of
the
object.
The
tight
pounding
box
of
the
object
is
what
we're
after
and
the
dimensions
and
orientation.
A
Yeah,
but
it's
true
that,
for
example,
in
one
of
those
data
sets
they
took
the
extra
mile
and
gave
us
this
and
gave
them
an
ass,
but
the
community
the
supervision
of
the
a
model
box,
namely
the
seen
and
unseen
parts
of
the
model
right.
So
if
you
have,
for
example,
this
sorry
yeah
back
of
a
chair
here,
then
this
supervision
for
the
box
will
be
the
entire
chair
and
that
you
can
do
from
the
to
the
image,
for
example
right.
A
So
you
know
you're,
not
just
given
that
in
that
other
data
set
we
worked
on,
then
the
boxes
are
tight
boxes.
So
if
that
that
was
produced
from
a
fused
scan
of
indoor
scenes,
so
they
walked
around
and
the
sample
things
or
whatever
you
have
like,
for
example,
a
refrigerator.
You
only
have
the
front
side
of
your
fridge
or
you
never
see
the
the
whole
depth
of
the
thing,
and
then
you
have
a
truncated
box
and
that's
very
complicated,
because
now
we
have
to
tell
your
network
like
the
ground.
A
A
Ten
minutes
less
20
minutes,
25.
Okay,
so
probably
a
you
know,
an
actual
transition
from
from
point
clouds
to
two
meshes
would
be
to
say
that
until
now
we
only
cared
about
it
boy,
the
points,
and
now
we
also
care
about
how
the
points
are
connected
right.
So
one
good
example
is
when,
when
you
know
the
connectivity,
let's
start
with
that.
That's
simple
right:
someone
told
you
what
the
connectivity
between
between
the
points,
but
maybe
to
have
like
maybe
arm
or
just
speak,
the
understanding
one
step
further.
A
It's
not
exactly
the
same
right
because,
for
example,
if
you
want
to
represent
the
stage
I'm
standing
on
right,
if
you
need
a
point
to
represent,
if
you
need
a
whole
bunch
of
points,
but
if
you
have
meshes
so
those
triangles,
then
you
can
just
take
a
bunch
of
points
and
connect
them
right.
That's
enough!
So
you
can
go
from
a
very
dense
point
cloud
to
a
mesh,
that's
great,
but
the
most
common
case
would
be
that
you're
given
a
mesh,
namely
it's
a
more
efficient
representation
representation
of
what
already
is
there?
A
A
That
means
that
the
shapes
need
to
deform,
and
that
here
shows
that,
if
you
keep
using
the
same
filters
that
you
learn
on
one
model,
but
now
you
say
you
deformed
it
a
little
bit,
then
this
is
not
going
to
be
useful
anymore.
But
if
you
can
learn
filters
that
kind
of
band
along
or
or
deform
along
with
your
representation,
that's
more
useful.
You
can
reuse
them
right
so
again,
a
demonstration
of
what
is
intrinsic.
A
So
where
do
we
have
known
connectivities
it
actually
in
many
cases
right
so
I
work,
a
lot
on
non
rigid,
3d
shapes
so
human
shapes.
We
have
very
good
models
for
those
we
have
decent,
meshes
and
and
also
social
graphs.
So
some
companies
have
large
social
graphs
and
they
know
how
people
are
connected,
and
so
what
do
we
want?
Out
of
those
out
of
those
filters?
A
Right,
so
we
want
the
same
thing
we
have
in
in
2d
right,
so
we
want
convolutional
filters
and
when
I
say
we
want
convolutional
filters,
I
actually
mean
we
want
to
do
weight
sharing,
let's,
like
the
key,
the
key
property
here
right.
How
do
we
share
the
weights
from
one
location
to
other
location?
There's
lots
of
reoccurrence
that's
happening.
We
want
to
utilize
it.
A
We
want
multiple
layers,
because
you
know
deeper
networks
are
better
right
and
we
want
again
some
property
of
locality
and
and
yeah
I'll
say
something
about
that.
You
know
right
so
I
thought
of
how
to
present
this
this
topic,
because
it
has
some
history
and
maybe
you
as
physicists,
will
appreciate
it
more
I.
A
So
this
could
be
again
just
ones
and
zeroes,
depending
on
whether
two
points
are
connected
or
not,
but
they
could
be
more
meaningful
and
what
we
want
here
that
we
lack.
So
what
we
lack
here
that
we
do
have
in
images
is
some
kind
of
an
ordering.
So
if
you
have
a
filter,
you
can
place
it
on
top
of
your
image
and
you,
you
know
which
way
it
multiplies,
which
kind
of
you
know
image
location
here.
A
You
don't
have
that,
so
you
again
resort
to
some
kind
of
aggregation
operator
that
symmetric
right,
other
max
or
so
so
now.
Essentially,
what
you
have
is
this,
this
type
of
symmetric
copper.
So
you
have
a
layer
that
consumes
two
vertices
and
an
edge
and
just
performs
the
symmetric
operations
on
top
of
them,
and
then
you
can
follow
with
some
kind
of
a
nonlinear
function
right.
So
so
what
did
we
say
here?
Essentially,
we
said
something
super
simple
right,
so
we
want
the
output
of
a
con
flare
of
that
point.
A
What
we're
gonna
do
is
we're
gonna
place
a
filter.
We
know
it's
say
the
one
ring
of
that,
so
we
know
jccc
matrix
of
of
the
graph,
so
we
can
just
combine
all
the
features
say
we
average
right.
So
we
can
combine
all
the
features
from
nearby
points
right.
We
push
them
through
some
kind
of
a
nonlinear
computation
and
that's
our
new
value
for
the
next
stage.
All
right
and
we
have.
If
we
have
a
bunch
of
those
filters,
then
we
can
represent
things
that
are
a
little
bit
more
complicated
right.
A
So
if
you
follow
like
recent
literature
on
graphing
or
networks,
basically
all
methods
they
use
that
as
like
the
baseline
and
then
start
complicating
things
from
from
from
this
point.
On
and
complication
can
come
in
many
forms,
so
they
could
either.
You
know,
how
do
you
prevent
from
just
using
this
very
simple
operations?
So,
for
example,
you
can
think
of
maybe
I
want
to
attend
some
points
more
than
others
right,
so
I
can
think
of
how
I
weight
an
edge
between
two
points.
A
A
Is
that
where
people
have
realized,
but
I
can
just
say
a
few
words
right,
people
have
realized
that
you
know
it's
very
hard
to
define
the
shift
operator
on
this
structure
right
different
from
a
patch
that
you
can
just
you
know
you
can
just
change
its
position
in
a
very
consistent
way.
If
you
have
a
graph,
then
how
do
you
go
from
one
point
to
the
other?
A
A
The
graph
updation
that
will
allow
you
to
take
the
signal
and
represent
it
as
a
bunch
of
coefficients
in
a
Fourier
transform
and
then,
if
you
also
represent
your
filters
as
a
bunch
of
coefficients
in
the
Fourier
space,
then
you
just
need
to
learn
those
coefficients,
multiply
them
and
then
resynthesize
the
function.
If
you
want
to
go
back
to
the
primal
and
you
have
to
go
back
actually,
the
non-linearity
was
introduced
in
the
in
the
primal
space.
So
you
have
to
go
back.
Do
the
non-linearity
so
forget
the
computational
complexity
for
a
second.
A
You
have
a
solution
to
that.
One
problem
with
that
was
that
people
saw
that
it's
not
it's
very
sensitive
to
the
graph.
So
the
graph,
a
patient,
is
unique,
paragraph
different
from
a
food
truck
for
a
transforming
2d.
That
will
be
the
same
for
all
images
out
there
in
a
graphic
depends
on
the
graph
and
this
filters
the
filters
that
people
have
learned,
weren't,
transferable
to
other
graphs
and
that's
the
property
of
locality
that
was
mentioning
earlier
and
one
way
they
solved
it
in
in
sweidy
was
to
define
local
neighborhoods
on
a
3d
mesh.
A
A
Oh,
you
have
sorry
or
you
can
do
a
symmetric
operation
that
that
also
ignores
orientation.
But
if
you're,
lucky
and
you're
on
a
match,
then
you
can
actually
use
properties
like
the
largest
curvature
to
kind
of
guide
you
as
the
north
right.
So
now
you
can
rotate
all
patches
to
be
to
have
a
consistent
representation
of
the
patches
and
you
can
go
back
again
to
learn
those
features
that
you
like
in
2d,
but
only
now
on
a
mesh
all
right,
so
you
can
translate
them
between
different
different
positions
and
different
ways
to
measure.
A
You
know
the
distance
between
a
point
and
the
patch
that
you're
that
you're
taking
that
you're
extracting
or
proposed,
and
that
addition
was
very
helpful
in
practice
to
do
some
kind
of
so
once
you
detect.
This
is
a
symmetric
operator,
like
the
one
I
showed
here
right,
but
this
one
already
gives
you
a
principal
direction,
so
you
can
actually
use
consistent,
patches
representation
to
learn
more
so
each
filter
can
hold
more
more
information
can
distinguish
between.
You
know
the
hand
oriented
like
this
or
orient
like
that
right,
so
right.
A
So,
for
example,
let's
let's
think,
let's
imagine
you
have
a
graph
representation
of
a
mesh
of
a
person
and
then
suddenly
do
something
like
a
topological
change.
Okay.
So
if
I,
you
know
you
computed
some
filters
on
this
shape,
but
now
the
hands
are
touching
right
and
there
you
know
edges
connecting
these
points
right.
You
don't
know,
you
don't
know
that
these
are
separated.
So
you
change
the
topology.
Suddenly
you
change
something
locally,
but
it
influenced
the
entire
global.
A
A
On
how
you
think
of
it,
if
you
think
of
it
in
primal
space,
then
being
local
is
just
being
local
just
to
crack
the
patch
that
looks
no
further
than
a
one
or
two
hop.
If
you
look
in
the
Fourier
space,
you
just
think
of
very
smooth
functions.
So
if
you
try
to
do
that,
you
can
do
that
on
the
images
as
well
right,
yeah
that
then
there
shouldn't
be
a
problem
right.
A
The
problem
is
that
that
when
you,
when
you
have,
when
you
have
a
graph
or
or
or
a
shape,
you
have
to
compute
a
graph
laplacian
and
that
now
is
different
depending
on
on
the
inputs.
Now
it
you
know
it
changes
with
its
back
to
the
input.
So
you
need
things
that
you
can
reliably
transfer
between.
You
need
filters
that
you
can
reliably
transfer
between
different
different
graphs.
A
A
A
So
if
you
try
to
do
a
partial
matching,
let's
say
I
scan
you
from
the
front
side
and
I
want
to
find
correspondences
between
each
point
on
that
graph
and
in
a
model
of
a
human
shape.
Right
they'll
also
be
very
sensitive
to
the
to
the
graph
change.
So
what
was
the
first
of
it,
you
said,
should
I
think
of
them
as
what
was
right.
A
Yes,
right,
I
think
I.
Think
I
could
return
the
question
to
you
right
because
if
you
you're
the
one
giving
building
the
graph
right
so
here
in
the
case
where
we
assume
the
graph
is
given,
then
it's
up
to
the
user
to
decide
what's
an
edge
or
what's
the
meaning
of
an
edge.
So
if
you
decided
to
take
an
image
and
build
a
graph
from
that
image,
according
to
spacial
proximity,
that's
one
option:
if
you
chose
to
build
it
according
to
color
proximity,
that's
another
option,
and
this
will
be
different
graphs
right.
A
You
can
mix
the
two
and
be
kind
of
a
bilateral
or
or
whatever,
but
yeah,
and
and
if
you
are
not
given
the
graph
and
so
I
think,
let
me
give
a
couple
of
examples
and
then
and
then
I'll
talk
to
you
about
when
you're
not
giving
the
graph
and-
and
how
is
that.
Is
that
different?
So
this
this
technique,
I'm
showing
you're.
These
techniques
have
been
used
in
mostly
in
in
shape
matching
applications
where
you
want
to
find
correspondences
between
each
point
say
on
this
graph
and
after
the
object
has
been
has
been
deformed.
A
A
This
I
just
had
to
show,
because
we
had
such
a
cool
visualization.
We
actually
had
a
friend
an
artist
friend
that
generated
those
meshes,
but
I
did
hear
again
that
this
was
kind
of
an
example
of
how
do
you
do
that
in
a
non
in
an
unsupervised
fashion,
but
I
don't
have
to
go
into
that
yeah.
So
partiality,
as
I
started
to
explain
is
like
another
is
another
part
where
you
really
saw
a
difference
between
using
you
know,
graphs,
neural
networks,
kind
of
like
the
fourier
graph
of
artworks
and
and
local
filters.
A
So,
firstly,
I
mean
exactly
the
example.
I
gave
you
like
if
you
have
a
partial
mesh,
so
this
is
given
as
a
mesh,
but
it's
just
produced
from
one
viewpoint,
so
you're
not
saying
later
and
the
colors
here
are
just
the
texture.
So
these
should
match
some
global.
Some
complete
model,
yeah
I,
think
I'll
skip
yeah.
This
is
an
example
showing
one
you
know
concrete
application,
but
maybe
I'll
just
you
know,
spend
a
minute
on
this
slide.
A
This
is
like
one
concrete
example
of
the
usage
of
a
graph
neural
network,
so
just
to
understand
the
setting
of
the
problem.
The
setting
is
that
you're
given
partially
input.
In
this
case,
you
know
we
remove
the
limbs
of
a
person
and
we
wanted
to
complete
them
right
and
there
are
many
plausible
way
in
which
they
could
be
completed.
But
let's
yeah
forget
about
that
for
a
while,
just
just
think
of
a
partial
graph
and
you
wanna,
and
you
want
to
learn
how
to
encode
that
into
some
kind
of
a
representation.
A
So
here
we
use
the
technique
that
actually
uses
this.
The
graph
is
the
connectivity
vertices,
and
this
is
like
very
similar
to
actually,
if
you
think,
of
a
2d
convolution,
but
instead
of
you
know
a
now,
you're
not
allowed
to
place
like
one
filter
on
top
of
the
Imogen
and
ask
the
the
order
to
stay
constant.
Instead,
you
can
think
of
take
take
each
of
those
filters
along
the
its
dimensions
and
just
perform
its
computation.
A
It's
multiplication
with
all
the
optional,
with
all
your
your
neighboring
vertices
right,
then
you
can
just
have
what
people
of
those
so
essentially
yeah.
It's
a
little
bit
hand
wavy,
but
I'm.
Just
saying
that
in
terms
of
computational
complexity,
it's
the
same
one
as
you'll
have
in
in
in
images
it's
just.
If
you
transpose
the
dimensions,
the
computation
becomes
a
little
bit
different,
but
you
get
you
get
this
thing
to
look
exactly
like
a
graph
graston,
so
this
will
give
you
the
output
of
each
of
each
new
vertex.
So
again
you
take
the
filters.
A
The
important
bit
or
the
main
difference
from
images
is
that
now
you
can't
commit
to
the
location,
so
you
just
take
in
each
each
and
think
of
this
is
like
a
one-by-one
convolution,
the
best
analogy
and
that
one-by-one
convolution
is
just
doing
that
operation
on
all
of
your
neighboring
points,
and
then
you
aggregate
right.
So
this
way
you're
able
to
build
from
whatever
input
you
have.
You
stack
those
layers
together
to
get
to
get
a
code
that
way
you
can
build
an
autoencoder,
and
that
was
the.
A
Yeah,
let
me
go
just
to
the
point
here
right,
so
you
asked
earlier
about
what,
if
you're,
not
giving
the
the
edges
right
or
I
wanted
to
talk
about
what
happens
if
you're
not
giving
this
so
I'm
using
you
as
just
a
surrogate,
so
so
I
think
until
now
the
examples
that
I
gave
you
know
they
had
a
samson
of
given
the
edges.
So
he
was
up
to
the
user
to
say
how
do
you
connect
to
two
vertices
either
by
special
proximity
or
by
some
other
features,
but
this
work.
A
I
think
it
was
very
interesting
again
touching
sweidy,
but
now
saying
something
a
bit
different.
It's
basically
saying
I
have
a
bunch
of
points
now
we
already
saw
point
cloud
networks
can
embed
each
point
with
some
high
dimensional
feature
by
looking
at
some
aggregation
of
a
local
neighborhood.
So
what
if
I
now
invent
edges-
and
I
invent
them
in
the
way
that
measures
the
difference
between
the
features
right?
A
A
Another
option
that
I
haven't
discussed,
but
it's
like
a
very
interesting
extension
that
that's
been
going
on
in
also
recent
efforts
is
to
say
edge
features.
If
you
give
me
edge
edge
edges
in
the
graph,
then
I
can
only
address
edges
that
either
exist
or
don't
exist,
but
if
I
want
them
to
hold
more
information,
maybe
I
also
want
to
update
that
information
that
they
carry.
So
imagine
that
you
give
me
a
graph.
You
gave
me
a
graph,
but
you
have
some
errors
there.
A
So
it's
a
I,
don't
know
some
kind
of
a
citation
at
work
or
social
graph,
and
you
accidentally
connected
to
people
it
should
that
ad
shouldn't
be
there.
How
can
you
fix
that
right?
So
one
option
that
we
proposed,
but
recently
there
are
many
other
I
think
techniques
is
to
work
on
the
line
graph
or
the
dual
graph.
What's
called
where
each
edge
now
becomes
a
node
and
each
node
becomes
the
edge
right
so
using
the
same
techniques
of
graph
neural
networks.
A
You
can
imagine
like
each
layers
being
as
being
going
back
and
forth
between
primal
and
dual,
so
you
first
have
the
dual
to
kind
of
update
the
edge
features,
and
then
you
use
the
update
that
edge
features
back
in
the
primal
graph
and
do
another
step
of
graphing
or
I
think
if
we
ever
tested
it
on
unweighted
examples,
I
think
eventually,
even
if,
if
it's
even
if
it's
not
weighted
in
the
first
iteration,
it
will
become
weighted
in
the
second
iteration
right
right
right
right.
So
you
have
to
take
that
weight
into
account.
A
But
why
is
that
different?
Then?
All
right,
I,
don't
think
I
have
enough
time
to
kind
of
open
up
a
new
sort
of
representation.
But
let
me
just
so
you'd
know
it
exists
right.
So
we've
seen
box
upgrades,
we've
seen
point
tiles.
We've
seen
meshes
recently
like
in
past
one
year,
roughly,
there's
been
a
lot
of
interest
in
in
this
implicit
surface
representation
and
that
the
finger
I'm
excited
about
this
topic,
especially
because
I
feel
like
it's
very
natural,
to
ask
a
network
to
represent
this
implicitly.
A
A
So
some
nice
properties
that
we
get,
for
example,
the
fact
that
the
that
they'll,
naturally
using
a
stochastic
gradient
descent,
will
find
a
regularized
smooth
decision
boundary,
for
example,
hints
that
if,
if
I
represent,
if
I'm
trying
to
teach
my
network
about
the
existence
of
a
surface
by
giving
it
points
off
the
surface
and
on
and
on
the
surface,
and
let's
say
that
the
input
I
give
it
is
a
little
bit
noisy.
We
can
still
expect
the
network
to
behave
the
same
way
as
you.
If
you
would
feed.
A
Let's
say
2%
error
into
a
classification
effort
right,
take
em
nest,
take
2%
of
your
data,
replace
its
its
label,
the
network,
who
will
learn
to
to
cope
with
that,
so
something
similar
will
happen
here
as
a
recent
papers
show
which
gives
the
network's.
So
this
rigor
rigor
natural
regularization
that
happens
in
networks.
Give
you
this
nice
smooth,
diffusion,
boundary
representation
of
the
surface,
so
I
think
this
is
like
a
very
promising
direction.
A
Actually,
in
order
to
go
from
that
to
having
something
that
you
can
display,
you
have
to
use
outside
kind
of
procedures
like
marching,
cubes
or
stuff
like
that
that
take
occupancy
and
perform
transform
them
back
to
two
meshes,
so
you
can
display
them,
but
it
also
gives
a
good
opportunity
to
like
use
this
universality
approximation
theorem
of
networks
that
they
can
essentially
represent
any
function.
You
want
so
one
example
that
maybe
is
a
good
go-to.
If
you're
interested
in
that
in
that
topic,
is
there
isn't
a
deep
SDF
paper
you
can?
A
Think
of
this
is
just
you
know:
you're
learning,
to
overfit
to
one
specific
shape,
so
the
network
doesn't
see
a
collection
of
shape,
or
one
version
of
that
paper
is
that,
where
the
network
is
just
overfitting
on
a
simple
on
a
single
shape
and
what
you're
learning
is
a
code?
Is
they
call
it
an
o2
decoder
right?
A
So
you
learn
a
code
and
when
you
introduce
an
XYZ
point
that
you
could
think
of
like
having
as
like
a
cube
of
X,
Y,
Z's
you're,
just
telling
the
network
okay,
please
sample
whatever
implicit
representation
of
the
surface.
You
have
please
sample
it
at
location,
X,
Y,
Z.
So
from
the
existing
XYZ
point
you
have
or
the
dice
the
distance
functions
of
them
to
the
surface.
You
can
teach
that
network,
but
later
on,
you
can
potentially
sample
an
infinitely
infinitely
dense
resolution.
A
And
if
the
network
is
has
learned
something
meaningful,
then
you
should
have
like
a
free
high
resolution
version
of
whatever
you
started
with,
and
it
shows
very
promising
results.
I
think
this
a
very
cool
direction
to
pursue.
Some
interpretation
result
that
they
show
yeah,
I.
Think
I'll
finish
here.