►
From YouTube: Week 11 - Qualitative Choices in Representations for Molecules, Materials, and Surfaces - Ulissi
Description
Speaker: Zachary Ulissi, CMU
More about this lecture: https://dl4sci-school.lbl.gov/zachary-ulissi
The Deep Learning for Science School: https://dl4sci-school.lbl.gov/
A
We
are
saving
the
best
for
last.
We
have
we're
very
we're
very
pleased
to
have
zach
ulissy
with
us
today.
So
zack
is
an
assistant
professor
of
chemical
engineering
at
carnegie
mellon.
A
He
works
on
the
development
and
application
of
high
throughput
computational
methods
in
catalysis,
machine
learning,
models
to
predict
their
properties
and
active
learning
methods
to
guide
these
systems.
Applications
include
energy
materials,
co2
utilization,
fuel
cell
development
and
additive
manufacturing.
A
He's
been
a
part
of
this,
our
community,
the
national
lab
hbc
communities
for
a
while
he
did
his
phd
as
a
doe
csgf
fellow
at
mit,
and
today
zach
is
going
to
talk
to
us
about
representations
for
molecules,
materials
and
services,
a
topic
that
is
of
interest.
I
think,
to
a
lot
of
us
doing
science
here
so
with
that
zach.
I
think
you
can
take
it
away.
Thank
you.
B
B
I
was
excited
when
mustafa
asked,
if
I
wanted
to
give
a
talk
on
representations,
because
this
is
something
that
has
been
moving
very
quickly
and
I
found
myself
having
trouble,
keeping
track
of
what
was
going
on
in
the
field
and
what
different
people
were
doing,
and
so
I'm
really
glad
that
this
was
the
the
right
impetus
for
me
to
sit
down
and
organize
things
and
think
about
what
what
everyone
is
working
on.
B
B
B
I
don't
think
I'm
going
to
go
into
quite
as
much
detail
as
she
did
on
exactly
how
the
math
of
those
transformations
work
but
they're,
very
much
aligned
and
the
work
that
she's
talking
about
is
really
where
I
think
a
lot
of
the
representations
are
heading
and
there's
a
lot
of
progress
in
the
area.
So
I'll
go
over
very
briefly,
some
of
the
things
that
she
talked
about,
maybe
a
little
bit
closer
to
the
end.
B
But
if
you
find
yourself
really
asking
what
should
the
ideal
representation
be
or
where
is
this
field
going
in
five
years?
I
would
go
back
and
check
her
talk
as
a
refresher.
B
So
one
of
the
reasons
why
I
think
this
area
is
so
exciting
is
because
so
much
is
changing
so
quickly
among
these
related
problems.
So
in
this
talk
I
want
to
talk
about
three
main
classes
of
materials
and
molecules.
The
first
is
small
molecules,
that's
where
a
lot
of
things
are
being
done
for
the
first
time.
B
The
second
is
inorganic
materials
which
are
really
important
for
a
lot
of
energy
applications
and
they
have
their
own
set
of
challenges.
That
I'll
talk
about,
and
the
third
in
the
upper
right
is
little
picture
is
catalysis
and
surface
science,
which
is
very
related
to
the
first
two
but
sort
of
combines
both
the
challenges
and
I'll
talk
about
that
in
a
little
bit.
B
So
on
this
slide,
I
did
my
best
to
try
and
reproduce
what
I've
seen
in
literature.
I'm
sorry
if
I
missed
something
specific
what
I've
done
is
on
the
x-axis
I
just
plotted
roughly
when
there
seemed
to
be
a
burst
of
papers
or
when
the
seminal
paper
seemed
to
have
been
published
for
the
area
and
for
the
type
of
material
on
the
y-axis.
I
have
a
qualitative
breakdown
on
different
types
of
representations
that
have
been
used.
B
Some
of
these
are
a
little
bit
blurry,
so,
for
example,
as
I'll
talk
about
atomic
environments
and
graphs
sort
of
overlap,
and
it's
not
always
super
clear
which
bin
they
should
fall
under.
But
there
are
some
differences,
and
one
thing
that
I
find
very
interesting
is
that
if
you
look
at
the
x-axis,
with
the
exception
of
small
molecule,
fragment
models,
almost
all
of
these
are
really
2005
to
2020.
B
Those
have
really
been
in
just
the
past
five
years,
so
I
don't
think
I've
ever
put
together
a
talk
that
has
as
many
links
to
archive
as
I
have
in
this
in
this
one
today,
just
because
so
many
things
are
moving
so
quickly
and
there's
so
many
papers,
and
so
many
things
going
on.
So
it's
very
exciting.
B
At
the
same
time,
it's
a
little
bit
overwhelming,
because
when
we
sit
down
with
our
system
of
interest,
whatever
we're
trying
to
do,
we
have
to
make
a
choice
for
where
we're
going
to
start
and
what
class
of
materials
we're
going
to
go.
For.
So,
for
example,
if
I
have
a
if
I
have
a
student
join
my
group
and
we're
talking
about
how
to
try
and
solve
some
problem,
I
might
give
some
idea
on
which
of
these
qualitative
things.
B
But
the
high
level
perspective
of
what
is
really
the
best
and
how
are
things
moving
has
been
very,
very
difficult
to
keep
track
of,
and
it
can
make
a
really
large
difference
in
the
sort
of
behavior
that
you
get
out.
I
think,
as
tests
sort
of
showed
in
some
cases,
very
subtle
changes
in
the
models
of
the
representations
can
have
very
large
impacts
and
what
sort
of
behavior
you
can
you
can
see?
B
Okay,
so
today,
what
I
want
to
do
is
first
talk
about
why
there
are
different
challenges
in
small
molecules,
materials
and
surfaces,
and
each
of
those
subfields
have
their
own
issues
and
questions
and
the
way
that
they
think
about
the
world
impacts
the
way
that
they
represent
things.
And
so,
if
we
don't
understand
where
they're
coming
from
or
what
challenges
they're
trying
to
solve,
we're
not
going
to
be
able
to
get
the
context
for
why
they're
using
different
types
of
models.
B
Then
I'll
go
through
each
of
these
different
qualitative
representation
classes
that
I
just
showed
in
the
previous
slide
and
then
at
the
very
end,
very
briefly,
I'll
talk
about
some
recent
work
that
has
been
coming
out
on
how
to
automatically
try
a
lot
of
these
representations
and
find
the
one
that
works
best
for
your
system
like
most
automl
things
it
works
sometimes,
but
is
not
always
the
best
and
then
a
little
bit
of
future
outlook
and
where
I
think
there
is
progress
and
where
I
think
there
needs
to
be
some
development
in
order
to
enable
some
new
areas.
B
Okay,
so
I
want
to
start
with
small
molecules,
because
that's
really
where
a
lot
of
these
representations
are
coming
from,
so
the
things
that
govern
small
molecules
are
really
mostly
models
of
simple
hydrocarbon
systems
or
oxygenates.
So
these
are
small
molecules
by
that
I
mean
usually
less
than
20
heavy
atoms.
A
heavy
atom
is
anything,
that's
not
a
hydrogen.
B
Usually
it's
either
carbon
nitrogen
oxygen.
Sometimes
you
also
have
sulfur.
In
there
the
space
is
pretty
overwhelming.
So
there
was
a
really
well
known
work
in
about
2010
that
brute
force
enumerated
every
small
molecule
under
17
heavy
atoms,
and
with
that
brute
force,
enumeration
you're
already
at
160
billion
possible
molecules.
So
that's
wild
17
isn't
even
as
high
as
you
go
right.
B
It
would
be
very
easy
to
add
more
things
onto
this
molecule
in
the
upper
right,
so
this
is
really
a
combinatorial
space,
if
you
add
in
any
sort
of
metal
center-
or
you
start
talking
about,
for
example,
sulfur-
I
don't
think
that
was
in
the
original
one
that
gets
very,
very
complicated
very
quickly.
B
At
the
same
time,
the
number
of
elements
that
are
considered
with
these
small
molecules
is
very
small,
so,
like
I
said
really,
it's
focused
on
cno
sh,
sometimes
fluorine,
because
fluorinated
compounds
are
pretty
popular
but
really
there's
only
four
or
five
or
six,
depending
on
whether
or
not
you
count
the
hydrogen
in
your
representation
or
not,
and
that
changes
the
way
that
people
think
about
these
models.
B
It
is
okay
to
learn
why
nitrogen
is
different
from
carbon,
simply
through
brute
force
of
showing
in
enough
situations
where
nitrogen
is
different
from
carbon
in
the
inorganic
material
spaces.
I'll
talk
about,
there's
many
different
elements,
and
so
that
changes
the
way
that
you
have
to
think
about
the
problem
a
little
bit.
B
Another
thing
to
keep
in
mind
is
that
in
the
small
molecule
space,
the
computational
chemistry
methods
are
really
well
developed.
There's
a
small
number
of
atoms,
there's
also
a
small
number
of
electrons,
and
so
that
means
you
can
use
very
accurate,
computational
chemistry
methods
and
they
don't
take
a
huge
amount
of
time.
We
can
run
dft
on
a
molecule
like
this
and
it
might
take
order
minutes.
B
There
are
people
who
do
really
top
level
calculations
like
quantum
monte
carlo,
where
you,
you
basically
get
the
exact
answer,
and
it
is
possible
to
basically
go
to
that
level
and
say
what
is
what
is
the
exact
answer?
So
there
are
a
lot
of
really
large
data
sets
that
have
taken
advantage
of
this,
the
scalability
and
this
relatively
fast
compute,
and
that
means
that
the
data
sets
here
are
very
large.
B
They're
they've
been
around
for
a
while
people
have
been
doing
high
throughput
calculations
for
small
molecules
for
a
long
time.
There's
many
different,
self-consistent
databases
available.
Most
of
them
are
more
than
100
000
molecules,
there's
already
some
that
are
more
than
100
million,
which
is
just
wild
if
you
think
about
100
million
dft
calculations.
B
At
the
same
time,
there
are
some
things
that
make
this
problem
really
hard
and
specifically
it's
the
fact
that
people
really
care
about
entropy
and
fluctuations,
and
so
those
two
things
are
sort
of
difficult
to
capture
with
simple
dft
calculations,
and
the
other
problem
is
that
for
a
lot
of
biological
systems,
because
things
are
so
so
targeted
in
this
area.
Very
small
energy
differences
are
important,
so
maybe
on
the
order
of
1k
cow
per
mole,
this
is
much
more
stringent
than
in
a
lot
of
inorganic
materials.
B
So
there's
a
lot
of
data
and
the
methods
are
good,
but
at
the
same
time
you
also
have
to
be
really
accurate
for
people
to
trust
your
results
and
the
obvious
application
areas
for
these
small
molecules,
the
ones
that
are
driving
most
of
the
materials
discovery.
Efforts
are
things
like
biofarmer,
polymer
design
and
synthesis
and
organic
photovoltaics.
B
B
So
I
mentioned
that
a
group
had
already
brute
forced
the
number
of
small
molecules,
and
this
data
set
is
freely
available.
It's
called
the
gdb17
dataset,
it
doesn't
have
energies
or
forces
from
dft.
It
is
just
an
enumeration.
This
was
in
2013..
B
So,
for
example,
the
qm7
data
set
is
the
gdb17
dataset
after
you
select
only
things
that
have
less
than
seven
heavy
atoms.
The
qm9
data
set
is
much
larger,
but
it's
the
same
idea.
Just
you
select
everything
up
to
nine
heavy
atoms
instead
of
seven
and
there's
a
whole
host
of
small
molecule
data
sets
that
are
all
based
on
this
original
gdb17
one.
It's
nice
that
someone
already
did
the
hard
work
to
say
what
is
available.
B
B
B
And
if
you
look
at
this,
you
can
see
that
there's
actually
a
a
rotation,
that's
possible,
I'm
going
to
swing
either
the
right
or
the
left,
functional
group
around
and
depending
on
how
I
do
that,
I'm
going
to
get
different
energies,
there's
going
to
be
different
local
minima
and
there's
also
going
to
be
a
barrier
for
going
over
that.
So
the
one
on
the
left
is
different
from
the
one
on
the
right,
because
the
ch3
groups
are
tilted
by
60
degrees.
B
B
The
other
thing
that
we
see
is
that
the
difference
between
local
minimum
is
very
small
noticeable,
but
small.
So
if
I
had
the
ch3
groups,
apart
or
across
from
each
other
versus
right
next
to
each
other,
there's
a
difference
of
about
four
k,
cal
per
mole
or
four
kilojoules
from
all,
so
this
is
a
relatively
small
energy
difference
and
if
we
want
to
be
able
to
capture
this,
we
need
a
method
that
is
at
the
four
kilojoule
per
mole
level.
B
These
conformations
are
really
really
complicated
to
capture
and
there's
a
lot
of
these
degrees
of
freedom.
This
is
a
very
simple
case
and
a
lot
of
these
larger
molecules
there's
many
many
different
bonds
that
you
can
rotate
around
and
get
different,
distinct
local
minima.
So
that
makes
this
problem
really
hard.
B
Okay,
so
that's
small
molecules.
The
next
thing
I
want
to
talk
about
is
the
work.
That's
been
done,
the
material
science
side
really
driven
by
efforts
like
the
materials
genome
project,
so
these
materials
are
come
from
a
very
large
set
of
possible
spaces.
There
are
databases
of
possible
experimental
data
sets
already
available.
B
The
computational
data
sets
have
gotten
much
more
popular
recently,
there's
a
lot
of
sort
of
order,
one
to
five
million
enumerated
structures
out
there.
But
this
is
really
just
a
very
small
subset
of
possibilities.
B
No
one
can
go
and
generate
every
possible
crystal
structure
with
every
possible
elemental
composition,
because
it's
just
too
large
and
it's
just
a
combinatorially
large
space.
So
there
is
no
amount
of
work
that
we're
going
to
do
to
make
something
like
the
gdb17
for
all
possible
crystal
structures.
What
we
can
do
is
look
at
one
of
these
large
databases
and
select
from
there
and
that's
usually
a
good
starting
point,
but
that
makes
things
difficult.
B
The
computational
methods
are
fairly
well
established.
Most
of
these
crystal
structures
are
pretty
small,
so
that's
good
dft
works
fairly
well.
For
most
of
these,
there
are
well-known
issues
when
you
go
to
things
like
large,
crystal
structures
or
you
want
to
incorporate
disorder
or
entropy
or
with
some
very
specific
classes
of
materials.
Like
oxides
where
dft
does
not
do
a
good
job
of
describing
the
behavior,
but
for
most
of
these
crystal
structures,
the
the
methods
are
fairly
well
established.
B
The
data
sets
are
getting
pretty
good,
there's
already
several
large
computational
databases
at
the
order,
hundred
thousand,
maybe
small,
number
of
million
size.
So
that's
excellent.
B
You
can
get
really
powerful
models
of
that
size
and
I
would
say
the
key
challenges
here
are
one
there's
a
lot
of
properties
that
we
want
to
consider,
and
so,
in
addition
to
the
normal
things
like
stability,
we
also
want
to
capture
things
like
mechanical
properties,
which
are
sort
of
more
complicated
calculations
or
thermal
or
electronic
properties.
B
That
can
be
a
little
bit
tricky
and
require
different
levels
of
theory.
Periodic
boundary
conditions
are
also
really
important,
and
I'm
going
to
talk
about.
Why
that's
an
issue
it's
a
little
bit
silly,
but
it
is
a
way
to
distinguish
what's
being
done
in
this
community,
from
what's
being
done
with
small
molecules
and
that
often
limits
the
representation
that
we
use.
B
The
accuracy
is
a
little
less
stringent
than
small
molecules.
Typically
we're
happy
with
something
on
the
order
of
50
milliliter
per
atom
accuracy.
So
that's
a
little
bit
better
than
with
a
small
molecule
case
and
the
driving
applications.
The
reason
why
people
are
pushing
these
efforts
are
really
energy
materials,
photovoltaics,
thermoelectric
batteries,
most
of
the
big
screening
efforts
are
in
those
spaces,
so
I
want
to
start
with
periodic
boundary
conditions
for
those
who
aren't
super
familiar.
B
B
B
The
cutoff
radius
in
this
case,
if
it
is
less
than
half
of
this
unit
cell
width,
the
atom,
cannot
see
itself,
but
if
this
gets
large
enough,
it
is
possible
for
it
to
see
itself,
and
that
becomes
an
issue,
that's
something
that
we
have
to
think
about,
and
we
have
to
make
sure
that
our
representation
captures
most
of
the
cutoff
radii.
That
I'll
talk
about
today
are
order
four
to
ten
angstroms,
usually
between
four
to
six.
So
usually,
this
is
fairly
local
and
usually
the
unit
cells
are
a
little
bit
larger
than
that.
B
The
reason
I
bring
this
up
is
because
it's
really
easy
to
describe
these
periodic
pattern
conditions.
It's
easy
to
think
about
it.
When
you
run
your
dft
calculation,
it
will
take
care
of
all
the
stuff
for
you,
but
it
is
also
really
really
easy
to
make
mistakes
and
it's
easy
for
this
operation
to
be
slow,
and,
I
would
say
my
own
research
group
and
experimental
collaborator
or
theoretical
collaborators
in
related
areas.
B
We
spend
an
embarrassing
amount
of
time
worrying
about
our
ppc
implementations,
even
though
it
is
relatively
simple.
So
if
you
are
taking
a
small
molecule
code
and
applying
it
to
materials,
this
is
something
you're
going
to
have
to
worry
about.
It's
probably
the
first
thing
that
you're
going
to
have
to
deal
with
another
thing
that
comes
up
is
a
modeling
question
of
what
do
you
do
after
you
apply
your
periodic
planning
conditions
so
on
the
left
here,
I've
shown
a
really
simple,
crystal
structure.
B
Let's
say
all
the
atoms
are
exactly
the
same,
but
inside
of
the
unit
cell
there
is
one
central
red
atom
and
then
there
is
a
another
blue
atom
in
the
unit
cell
that
is
being
repeated
around,
so
there's
basically
two
different
atoms
in
this
representation.
This
is
like,
if
you
went
to
the
materials
project
and
said,
give
me
the
cubic
cell
representation.
B
It
would
give
you
something
that
has
two
atoms,
even
though
they're
all
the
same
atom,
and
if
I
draw
a
little
cutoff
radius
of
four
angstrom,
that's
larger
than
the
bond
radius
of
about
three
and
a
half.
What
I
see
is
that
this
red
atom
is
could
be
considered
a
neighbor
for
all
four
of
these
blue
atoms.
B
Okay,
if
I
just
make
arguments
on
how
is
it
covalently
bound,
I
would
say
there
are
four
bonds
for
that
red
atom.
Probably
there's
a
modeling
question,
then
or
representation
question
on
what
do
I
do
with
this
bonding
information?
So
I
could
say
I
could
reduce
this
down
and
I
could
say
every
atom
is
bonded
with
itself
four
times
in
different
representations
if
it
is
the
same
red
atom
being
repeated
over
and
over
again,
a
common
assumption
in
some
of
these
graph
representations.
B
That
I'll
talk
about
is
that
you
just
pick
the
minimum
image
convention
nearest
neighbor
so
for
each
atom
type.
I
look
at
the
nearest
image
of
the
second
type
of
atom.
I
choose
the
one
that
is
closest
and
that's
the
one
that
I
put
into
the
representation
so
for
the
same
system.
I
would
label
this
as
a
red.
B
Atom
is
labeled
with
is
a
boundary
itself
if
I
don't
reduce
for
symmetry
now
those
two
types
of
atoms,
even
though
they
are
essentially
the
same
under
symmetry,
and
in
that
case
I
have
sort
of
the
same
question.
I
could
call
this.
Every
rad
red
atom
is
bonded
with
four
blue
atoms
and
vice
versa.
Right,
if
I
look
at
a
blue
atom,
I
can
say
there's
four
red
atoms
around,
and
I
could
also
say
that
if
I
just
look
at
the
nearest
image
convention,
every
red
atom
is
bound
to
a
blue
atom.
B
B
So
that's
the
one
that
wouldn't
change
after
you
repeat
it
under
periodic
boundary
conditions,
but
this
issue
comes
up
a
lot
and
when
you
read
papers
on
graph
convolution
methods
that
I'll
talk
about
this
is
a
common
modeling
question
that
partially
explains
why
some
models
work
better
or
worse
than
others,
and
this
is
worth
keeping
in
mind
again,
it's
very
simple,
but
it
is
a
real
logistical
challenge
when
dealing
with
these
representations.
B
Okay,
for
inorganic
materials.
There
are
a
couple
of
large
data
sets.
I
chose
just
two
based
on
ones
that
I
am
particularly
familiar
with.
The
first
is
a
flow
lib
run
out
of
duke
university,
and
the
second
is
the
materials
project
run
out
of
lbl
and
berkeley
by
christine
person
and
others.
Both
of
them,
I
would
say,
apply,
sort
of
the
same
way
of
thinking
about
things.
Calculations
are
fairly
similar,
aflolib
tends
to
have
more
enumerations
of
the
same
types
of
structures.
B
Materials
project
usually
has
is
a
little
more
driven
by
what
might
be
experimentally
relevant,
but
both
are
very
similar
ways
to
thinking
about
things.
B
I
highlighted
the
materials
project
just
because
when
you
read
data
sets
on
representations
in
material
science,
most
of
them
use
the
materials
project
data
set
as
a
benchmark.
So
that's
the
one
that
people
have
chosen
to
use,
not
for
any
particular
scientific
reason
just
because
it's
easy
to
download
from
and
people
are
familiar
with
it.
B
Okay
and
finally,
I
want
to
talk
about
why
surface
science
and
catalysis,
which
is
really
where
I
spend
most
of
my
time
in
my
research
group,
is,
is
so
complicated
and
why
this
has
been
really
challenging
for
me
to
think
about
these
representations,
so
the
possible
space
of
materials
or
configurations
that
I
need
to
think
about
is
really
overwhelming,
and
so
the
problem
is.
I
basically
take
all
of
the
diversity
of
small
molecules
that
I
just
talked
about.
There's
160
billion
small
organic
molecules
that
I
can
put
on
a
surface.
B
B
The
accuracy
of
the
computational
methods
are
also
a
limitation,
so,
for
example,
when
you
start
to
consider
an
extended
periodic
system
for
a
surface,
usually
the
number
of
atoms
goes
up.
It's
common
to
do,
20
to
100
atoms,
which
is
a
little
larger
than
inorganic
materials.
Usually
is
that
makes
dft
reasonable,
but
a
little
slow
and
there's
not
that
many
experimental
benchmark
methods
or
data
sets.
B
Charlie
campbell
at
university
of
washington
is
really
the
leader
in
those
efforts,
because
there's
so
few
numbers
that
we
really
100
absolutely
know,
there's
a
lot
of
different
competing
methods,
so
you'll
read
papers
in
this
area
and
people
will
use
pbe
or
rpb
or
v
fan
to
walls
or
hybrid
methods
or
rpa,
and
it
is
really
hard
to
say
exactly
what
is
the
right
answer.
Besides,
as
you
go
up
the
chain
to
hybrid
and
rpa,
it
probably
gets
more.
B
Accurate
disorder
and
large
nanoparticles
are
both
common
things
we
want
to
think
about
in
catalysis.
We
often
consider
oxides.
So
all
the
problems
with
oxides
I
mentioned
from
materials
also
show
up
here.
The
data
sets
are
really
small
compared
to
materials
and
small
molecules,
but
they're
growing,
all
the
ones
that
I'm
aware
of
are
less
than
100
000
structures
and
most
of
those
have
been
published
in
the
past
year
or
two
or
so.
B
So
we
have
all
of
the
diversity
of
the
first
two,
but
our
data
sets
are
orders
of
magnitude
smaller,
which
is
a
problem
common
challenges.
There's
a
lot
of
reactions
we
need
to
consider.
The
accuracy
is
not
super
stringent,
usually
plus
or
minus.
0.180
is
okay
and,
like
I
said,
the
things
that
are
driving
are
energy
materials,
but
on
top
of
that,
there's
applications
to
manufacturing
fuel
cells
and
batteries.
B
So
let's
go
into
a
little
more
detail.
So
for
one
of
these
surfaces,
I'm
thinking
about
all
of
the
possible
intermediates
that
I
could
have
in
a
possible
reaction
pathway.
So
this
is
a
paper
that
I
worked
on
when
I
was
a
postdoc
a
few
years
ago,
we're
looking
at
a
relatively
simple
system
of
co
and
hydrogen
gas
phase
reacting
to
selectively
make
one
of
a
number
of
possible
products
which
could
be
ethanol,
methane,
acetaldehyde,
methanol,
water
co2.
B
Ideally,
we
want
to
make
something
valuable
like
acetaldehyde
or
methanol,
and
we
don't
want
to
burn
into
co2
and
water
even
for
the
simple
rhodium
system.
This
is
one
metal
flat,
surface,
no
complexity.
There
are
thousands
of
possible
pathways
that
I
could
write
down
and
finding
just
the
reduced
pathway
on
the
right
means.
They
have
to
consider
all
the
possible
intermediates
and
all
the
possible
reactions,
so
that
gets
really
really
time
consuming
for
any
individual
intermediate.
B
I
have
to
do
a
series
of
calculations
where
I
watch
these
adsorbates
move
around
and
find
the
most
stable
configuration.
So
this
is
an
example
of
an
oh
on
a
nickel
gallium
site
from
some
unpublished
work.
I
guessed
it
should
be
on
a
nickel
green
and
it
looks
like
it
sort
of
moves
over
to
a
nickel
gallium
bridge.
This
is
actually
sort
of
moving
sites
and
it's
changing
configurations.
So
it's
very
dynamic.
B
B
B
Okay,
the
same
idea
of
small
molecules
on
inorganic
materials
comes
up
over
and
over:
it's
not
just
thermal
catalysis.
It's
also
co2
utilization,
water,
splitting
hydrogen
storage,
selective
catalysis,
water,
desalination
and
remediation
polymer
metal
interfaces,
corrosion
resistance.
All
of
these
are
basically
the
same
fundamental
question
of
how
do
small
molecules
interact
with
inorganic
surfaces
and
again
it's
the
same
hard
problem
that
shows
up
everywhere.
B
Okay,
so
with
those
ideas
in
mind,
we
can
start
to
think
about.
How
might
we
compare
different
types
of
representations
for
the
system
that
we
care
about,
so
I
want
to
start
with
small
molecules,
because
again,
that's
the
most
established
and
there's
been
a
lot
of
work
in
that
area
on
the
left
is
a
picture
of
some
work
on
developing
machine
learning
models
for
how
two
different
molecules
might
be
similar
to
each
other
using
the
qm9
data
set.
This
is
the
first
paper
to
really
say
what
qm9
was.
B
They
were
the
ones
who
did
those
calculations?
That's
only
2012.
Only
eight
years
ago
on
the
right
is
sort
of
a
review,
high-level
benchmark
paper
called
molecule
net
by
vijay
panda
at
stanford
and
what
they
did
was
they
compared
a
lot
of
different
methods,
and
this
is
a
very
similar
thing
that
you
would
see
in
any
other
area
of
machine
learning
we're
interested
in
how
does
the
accuracy
change
as
a
function
of
the
training
set
size?
So
this
is
a
simple
learning
curve.
This
comes
up
over
and
over
that's
great.
B
B
So
I'm
going
to
take
exactly
the
same
learning
curve
and
I'm
going
to
re-plot
it
on
a
loglog
scale
on
the
x-axis
it'll,
be
log
of
the
training
set
size
on
the
y-axis
it'll
be
log
of
the
mae,
and
what
we
see
is
that
a
lot
of
these
methods
and
I'll
show
you
ones
for
a
lot
of
different
things
for
qm9
in
a
second.
A
lot
of
these
methods
are
very
linear.
B
So,
surprisingly,
linear
in
this
log
space,
the
slope
of
the
method
says
something
about
what
is
the
effective
dimensionality.
So
we
want
this
thing
to
be
as
low
as
possible,
so
shift
the
whole
thing
down.
We
also
want
it
to
be
as
steep
as
possible,
which
means
adding
more
data
makes
it
grow,
and
we
can
often
get
some
insight
into
what's
going
on
with
these
systems
by
looking
for
features
like
a
plateau
in
this
space
usually
implies
that
there's
multiple
data
points
that
all
have
the
same
representation.
So
that's
something
that
we
can
see.
B
This
sort
of
analysis
assumes
that
you
have
a
uniformly
sampled
data
set
and
if
you
have
biased
data
you're
going
to
get
different
curves.
So
this
isn't
the
something,
the
sort
of
thing
that
you
can
just
do
for
the
model.
This
is
really
the
model,
and
but
it
does
give
us
some
idea
of
not
just
is
one
model
more
accurate
than
another
at
100
000
data
points.
But
how
is
it
scaling
and
is
the
representation
actually
more
powerful
or
not?
B
A
lot
of
different
small
molecule
models
and
anatoly
van
linfeld
at
university
of
basel
now
at
the
university
of
vienna,
has
really
driven
a
lot
of
this.
He
has
an
awesome
presentation
on
these
sort
of
ideas,
for
small
molecules
with
the
link
at
the
bottom.
This
paper
is
this.
Chart
is
from
his
slides
and
he
has
a
couple
of
related
papers,
and
what
we
see
is
that
there's
a
ton
of
different
methods
that
people
propose
to
small
molecules,
all
with
different
representations.
B
You
can
shift
up
and
down
and
there's
also
different
slopes
so
qualitatively
right
away.
We
can
see
that
there's
already
two
different
classes
of
methods.
One
is
these
ones
in
the
upper
right
things
like
vagabonds,
which
I'll
talk
about.
B
They
have
a
lower
slope
than
a
lot
of
these
other
methods
that
are
newer
and
a
little
more
complicated
and
include
more
properties
in
the
feature
set
so
qualitatively.
The
fact
that
those
two
slopes
is
different
says
that
the
red
one
is
the
lower
red
one
is
probably
more
powerful
because
it's
scaling
better.
We
can
shift
that
whole
line
up
or
down
by
playing
around
with
how
we
train
or
how
we
do
the
hyper
parameter
optimization,
but
it's
pretty
rare
that
you
actually
change
the
slope
by
playing
around
with
the
parameters.
B
This
is
very
powerful,
because
what
it
says
is
that
you
can
do
small
data
set
training.
I
can
do
a
hundred
and
I
can
do
500
and
a
thousand,
all
of
which
train
fast
and
all
of
which
I
can
train
on
a
google
collab
instance
or
my
own
laptop
or
a
desktop,
or
something
like
that,
and
the
scaling
of
that
says
something
about
how
accurate
you
might
be
at
a
hundred
thousand.
B
That
is
really
powerful
right.
That
says
that
you
don't
need
to
be
doing
these
really
crazy
trainings
in
order
to
develop
better
methods.
So
that's
really
cool
and
not
something,
that's
obvious
in
all
machine
learning
areas.
It
seems
to
be
something
about
these
large
diverse
well-sampled.
Unbiased
data
sets
okay.
We
can
do
the
same
thing,
not
just
for
small
molecules.
We
can
also
do
it
for
inorganic
crystals,
so
this
is
a.
B
This
is
a
chart
that
I
made
for
some
unpublished,
collaborative
work
on
the
left
are
small
molecules,
basically
the
same
as
before
qm9
with
two
different
types
of
models,
same
models
that
I
was
showing
here
on.
The
right
is
the
materials
project,
energy
formation
data
set
with
two
common
methods
that
I'll
talk
about
called
cgcnn
and
what's
interesting,
is
that
the
same
method
applied
to
two
different
data
sets
or
two
different
regimes
yields
two
different
slopes,
and
so
this
also
says
something
about
what
is
the
effective
dimensionality
of
this
problem
on
the
left?
B
B
B
Okay,
with
that
in
mind,
I'll
start
jumping
into
representations
yeah,
I
think
before
I
do,
that
yeah
there's
a
question
for
some.
A
Yeah
trying
to
interrupt
there
was
one
question
on
those
plazas
that
you
just
showed.
Do
you
want
me
to
read
it
or
do
you
want
to
read
it
yourself?
I.
B
Can
see
it
so
the
question
is:
what
does
the
y-axis
mean
here?
The
total
energy
or
thermal
chemistry
data?
Let
me
go
back,
I
think
you're
talking
about
this
one
is
that
right,
yeah,
that
was
it
cool
okay,
so
the
y-axis
here
is,
I
believe,
the
formation
energy
or
the
atomization
energy,
of
these
small
molecules,
and
that
is
basically
the
energy.
B
The
atomization
energy
is,
I
take
a
small
molecule
and
I
calculate
the
energy
and
then
I
keep
on
dilating
it,
making
it
bigger
and
bigger
and
bigger
until
all
the
atoms
are
really
spread
out
and
don't
see
each
other,
and
that
is
sort
of
the
same
thing
as
a
cohesive
energy
in
small
molecules.
B
So
those
are
the
two
metrics
that
people
usually
use
with
these
qm9
data
sets.
You
could
just
as
easily
do
the
same
thing
for
any
other
property
like
polarizability
or
maybe
how
much
it
like
some
solvent
or
some
electronic
property
like
the
band
gap
or
the
homo
aluminum
gap
or
whatever
else
you
want.
Those
are
all
common,
but
the
formation
energy
is
the
one
that
people
tend
to
use
when
developing
these
new
methods.
B
B
B
Hour
so
I'll
go
ahead
and
continue,
but
if
I
see
anything
pop
up,
I'm
happy
to
take
some
time
and
chat
and
discuss
a
little
more
okay.
B
So,
let's
start
thinking
about
different
types
of
ways
that
we
can
represent
these
these
materials.
So
the
first
one
that
I
want
to
start
with
is
the
simplest,
and
that
is
a
composition
feature
where
we're
just
looking
at
the
types
and
number
of
the
different
elements.
B
So
this
is
especially
common
in
material
science,
usually
experimentally,
if
you're
making
something.
You
might
only
know
this,
how
much
of
each
material
went
into
it.
You
might
not
know
exactly
what
the
crystal
structure
is
or
other
things
that
we
care
about
for
some
of
the
more
complicated
models,
and
so
the
general
idea
is
that
you
take
the
composition
on
the
left.
So
maybe
some
binary
oxide
or
some
other
ternary
inorganic
material,
and
you
want
to
enumerate
a
lot
of
features
of
each
one
of
these
element
types.
B
After
the
feature
combination
stage,
we
have
a
fixed
length
factor
that
we
can
play
around
with
the
pros
of
this.
Are
it's
very
simple?
It's
physically
motivated.
I
only
need
to
know
the
composition,
it's
also
very
simple,
and
because
it's
so
simple,
it
often
works
with
very
small
data
sets
that's
cool.
The
cons
is
that,
obviously
this
composition
doesn't
allow
me
to
specify
why
one
polymorph
of
this
binary
oxide
might
be
different
from
another
polymorph
with
the
same
stoichiometry,
so
it
can't
handle
polymorphs.
B
It
won't
tell
me
why
certain
structural
features
are
important
and
if
you
apply
this
to
a
very
large
data
set
like
the
materials
project
formation
energies,
it
tends
to
perform
quite
a
bit.
Worse
than
the
more
complicated
representations
we're
going
to
talk
about
later,
there
are
a
couple
of
libraries
that
people
use
in
this
space
over
and
over.
So
the
most
common
one
is
the
set
of
descriptors
called
the
magpie
descriptors
from
chris
robertson's
group
at
northwestern
university,
published
in
2014
bryce
meredig
was
one
of
the
students
on
that
paper.
B
Bryce
is
also
the
cto
of
situation
informatics,
which
does
a
lot
of
work
with
similar
problems
in
material
science.
Now,
basically,
they
went
through
and
collected
a
bunch
of
different
properties
and
some
standard
combinations,
so
some
combinations
of
the
types
of
elements
and
the
fractions
elemental
properties,
like
the
mean
absolute
deviation,
minimum
maximum
mode
whatever
for
things
like
the
atomic
number
or
the
radii
or
number
of
electrons
or
a
bunch
of
other
common
ones.
They
have
electronic
structure
attributes
and
they
have
ionic
control
ionic
compound
attributes.
B
There's
a
lot
of
implementations
now,
because
they're
very
common
there's
a
set
of
code
for
magpie
itself,
johannes
hackman,
has
done
the
same
thing
at
the
university
of
buffalo
anubhav
jain
at
lbl
has
a
bunch
of
these
implemented
in
auto.
A
B
Miner,
which
I'll
talk
about
a
little
bit
later,
but
I
would
say
a
lot
of
things
really
rest
on
these
the
same
set
of
descriptors
at
this
point.
B
You
can
take
these
descriptors
and
you
can
find
linear
or
nonlinear
combinations
that
might
make
even
better
descriptors,
and
so
this
is
very
common
in
the
material
space,
material,
scheffler
and
others
have
really
been
driving
this
luca
gerangeli.
B
Basically,
you
search
for
some
nonlinear
combination
of
these
that
describes
the
property
that
you're
interested
in
and
so
in
this
example.
They
basically
found
that,
with
these
two
really
complicated
descriptors,
they
were
able
to
really
well
separate
what
was
a
metal
or
a
non-metal,
and
this
is
cool
because
you
can
see
the
algebraic
formulation.
B
And
so,
when
you
deal
with
these
composition
features
one
of
the
really
common
things
that
you
do
to
improve
the
representation
is
run
it
through
a
code
like
ciso,
which
tries
to
find
the
best
nonlinear
combination
to
help
you
with
your
with
your
problem.
This
is
this
is
really
common.
So
that's
why
I
bring
it
up.
B
B
The
classical
paper
is
by
benson
when
he
was
at
usc,
and
it's
typically
referred
to
as
benson
group
additivity,
and
the
idea
is
basically
that
I
am
going
to
represent
the
energy
of
this.
Maybe
the
formation,
energy
or
something
else
of
the
small
molecule
by
adding
up
all
the
contributions
from
different
subsets.
B
This
is
motivated
by
looking
at
something
like
an
alkene
chain,
so
on
the
left
bottom.
This
is
formation,
energies
heats
of
formation
for
alkanes
of
different
lengths,
and
what
we
see
is
that,
as
we
add
more
and
more
ch2
groups
to
the
middle
of
this,
the
extra
heat
of
formation
always
increases
by
about
4.9.
B
The
way
that
I
apply
this
to
a
larger
molecule
is
that
I
go
through
and
find
all
of
the
unique
types.
So
there
is
one
methyl
group
with
a
carbon,
so
that's
labeled
one
or
there's
two
of
those
one.
On
each
side
there
is
one
carbon
that
has
two
methyl
groups
and
another
carbon
nearby
and
I'm
going
to
represent
the
total
energy
as
a
linear
combination
of
each
of
those
independent
fragments.
B
B
If
I
increase
the
size
of
the
radius,
I
will
get
larger
and
larger
fragments.
So
each
of
these
is
now
a
fundamental
thing
that
I'm
going
to
build
my
model
off
of.
I
could
build
a
linear
model.
I
could
also
build
a
nonlinear
model
and
one
common
way
to
take
all
these
different
fragments
and
turn
it
into
representation.
B
This
is
good,
and
that
is
simple
and
physical.
The
downside
of
this
approach
is
that
it
scales
very
poorly
with
the
number
of
elements
or
number
of
fragments,
so
every
fragment
is
considered
different
here.
There's
no
intrinsic
idea
of
why
some
of
these
alkane,
like
fragments,
are
similar
to
one
another,
they're
all
considered
completely
independent,
and
so
there's
no
way
of
combining
those
things
together.
If
you
see
a
new
fragment
that
you've
never
seen
before
in
a
hypothetical
molecule,
you're
sort
of
stuck-
and
that
makes
it
challenging.
B
Okay,
these
approaches
work
really
well
in
small
molecules
and
if
you
want
to
try
out
these
things,
my
suggestion
is
to
use
a
tool
like
rd,
which
is
open
source,
really
helpful,
we'll
do
all
sorts
of
different,
simple,
fragment,
based
fingerprinting
methods.
I
wouldn't
write
this
sort
of
thing
from
scratch.
There's
already
really
really
good
methods
for
doing
this.
B
Okay,
let's
take
a
second
before
moving
on.
I
see
another
question
on
slide:
20
there
was
ma
versus
training
data
set
size,
and
you
talked
about
good
ml
and
bad
ml.
B
Okay,
so
let's
take
a
step
back
and
then
I'll
come
back
to
this
one.
So
that's
29!
Okay!
So
let's
go
back
to
the
good
mlm
batman,
okay,
the
problem
in
this
case.
B
The
reason
why
it's
saturating
is
that
if
you
have
two
molecules
that
have
the
same
representation
and
one
has,
for
example,
an
energy
of
formation
of
five
kilojoules
per
mole
and
the
other
has
a
formation
energy
of
200
kilojoules
per
mole,
but
they
have
exactly
the
same
representation,
no
matter
how
good
your
machine
learning
model
is
neural
network
gaussian
process,
kernel
method,
whatever
you
want
it
doesn't
matter,
it
cannot
distinguish
between
those
two.
It
has
exactly
the
same
representation,
but
it's
labeled
in
two
different
ways.
B
So
the
best
that
you
can
do
is
you
can
guess
halfway
in
between
and
you're
bad
at
both
that
is
usually
one
of
the
driving
drivers
for
the
sort
of
saturation
behavior,
and
it's
basically
an
idea
that
your
model
is
not
complicated
enough
to
distinguish
between
different
things
that
have
different
properties.
B
The
same
thing
comes
up,
and
I
think
this
is
a
good
question.
The
same
thing
comes
up
with
composition,
descriptors,
for
example.
I
could
have
a
lot
of
compositions
that
have
very
very
different
energies
and
from
a
representation
like
this,
I
would
have
no
way
of
distinguishing
them.
The
model
would
have
to
label
all
of
them
the
same
energy
and
if
I'm
trying
to
label
a
bunch
of
polymorphs
and
they
all
have
different
energy
but
the
same
composition,
I'm
stuck
right
away.
My
my
model
is
not
going
to
get
any
better
after.
B
I
add
some
more
data,
because
I
cannot
distinguish
between
the
things
that
I've
already
seen.
This
comes
up
a
lot.
It's
a
good
way
of
diagnosing
either
issues
with
the
data
set
or
issues
with
the
representation.
The
same
problem
can
come
up.
If
you
have
bad
data,
where
you
have
two
of
the
same
molecule
and
they're
labeled
differently,
you
get
a
very
similar
sort
of
behavior.
B
B
B
B
It
is
very
powerful
because
there's
already
a
lot
of
really
good
machine
learning,
models
that
operate
on
text
strings,
and
so
people
have
spent
a
lot
of
time
applying
natural
language
processing
tools
to
these
sort
of
smiles
representations.
B
B
A
second
problem
is
that
if
I
just
generate
some
random
string
of
numbers
and
letters,
so
o
o
c
c,
whatever
it's
possible
to
generate
a
string
that
does
not
decode
to
a
real
molecule,
and
so
that
means
that
if
your
natural
language
processing
tool
is
just
spinning
out
strings,
some
of
those
might
not
even
be
molecules.
It's
just
making
up
nonsense.
B
A
lot
of
the
progress
in
small
molecules
has
been
driven
by
better
representations
or
by
basically
improving
the
actual
grammar
of
the
string
itself.
So
this
selfie's
representation
by
alanna
spuro-gruzik's
group
is
another
string-based
representation
that
basically
enforces
it
has
some
nice
properties
such
that
it
always
decodes
to
a
small
molecule.
B
So,
no
matter
what
string
you
generate,
you
can
decode
that
thing,
and
it
will
be
something
interesting.
So
that's
very
cool.
That
means
that
I
can
apply
all
the
super
cool
stuff
going
on
in
machine
learning
like
bert
or
other
nlp
models,
and
I
can
apply
those
directly
to
this
area.
I
can
also
apply
generative
text
models
to
small
molecules.
That's
cool!
B
B
One
of
the
cool
things
that
you
can
do
because
of
these
sort
of
representations
is,
you
can
come
up
with
variational,
autoencoders
or
other
generative
models
or
gans,
and
you
can
generate
new
small
molecules
that
are
basically
hypothetical
things
that
you
should
try
and
test.
This
has
gotten
very
hot
in
the
past
couple
of
years.
It's
really
interesting.
B
The
next
type
of
representation
I
want
to
talk
about
is
a
little
bit
more
complicated
everything.
Up
until
now,
I
haven't
really
talked
about
bonds
or
angles
or
other
complicated
features,
and
there
is
a
whole
host
of
methods
that
try
and
look
at
a
atom
and
its
nearby
neighbors
and
try
and
come
up
with
a
representation
that
describes
what's
going
on
locally.
B
So
one
of
the
most
common
is
something
called
a
high
dimensional
neural
network
potential
or
a
atom
centered
symmetry
function
or
a
baylor
paranello
machine
learning
potential.
All
of
those
are
the
same
thing.
The
idea
is,
basically
you
take
the
each
atom.
You
look
at
its
neighbors,
you
try
and
come
up
with
a
fixed
length
representation.
B
You
take
that
representation,
and
maybe
you
use
it
to
compare
it
to
another
structure
or
you
apply
that
to
some
machine
learning
model
and
you
try
and
predict
the
energy
and
sum
of
those
things
up.
There's
some
small
differences
in
how
that
gets
done,
but
the
same
idea
comes
up
over
and
over
these
baylor
paranello
or
hdnp
potentials
have
been
around
since
2007.
B
B
So,
for
example,
one
of
these
entries
in
this
vector
might
be
all
of
the
copper
copper
bonds.
I
am
going
to
take
all
of
the
radii
for
those
bonds.
I
will
use
it
in
this
little
lookup
table
with
some
ada
specified.
So
let's
say
eight
is
four,
and
the
radius
is
three:
I'm
going
to
look
up
that
value
and
it's
about
a
g2
of
0.15
or
so,
and
I
will
do
that
for
every
such
copper.
Copper
bond.
B
Add
them
all
together
in
my
local
radius,
cutoff
environment
and
take
the
sum
put
in
the
vector
I
can
do
the
same
thing
for
angles.
This
has
to
be
done
for
every
unique
combination.
So
I'm
going
to
take
all
of
the
copper
carbon
copper
angles
and
apply
the
same
lookup
table
idea
to
the
theta
that
I
get
out
of
that
angle.
Computation
and
I'm
going
to
add
up
all
of
those
that
happen
nearby
me
and
I'm
going
to
shovel
those
into
another
section
of
the
vector.
B
B
B
A
second
thing
is
that
this
implicitly
assumes
that
the
representation
should
be
local.
So
if
there
are
long
range
interactions,
it
can
be
harder
to
capture
those
in
this
sort
of
this
sort
of
model,
and
that's
another
active
research
area
that
I'll
talk
about.
B
There
are
many
many
many
such
local
environment,
fingerprints,
so
there's
reviews
coming
out
all
the
time.
This
is
one
from
goetheker.
Just
this
year
they
compared
many
body
symmetry
functions,
this
fchl
representation
from
anatoly's
group
soap,
descriptors,
orbital,
matrix
acsf,
which
is
the
one
I
showed
in
the
previous
one.
B
This
is
another
choice
you
have
to
make
for
every
single
one
of
these
there's
another
choice
of
what
should
all
of
these
magical
numbers
be
and
how
many
different
types
of
things
should
you
include,
so
the
problem
gets
a
little
bit
overwhelming
people
have
already
started
to
compare
accuracy
for
these
different
methods.
B
So
this
is
a
really
nice
paper
from
schweiping's
group
at
ucsd,
where
they
basically
went
through
and
for
a
a
very
common
material
science
problem
compared
different
types
of
descriptors
and
different
types
of
potentials,
both
in
terms
of
the
computational
cost
and
the
error,
and
one
interesting
thing
that
came
out
of
this
was
that
these
moment,
tensor
potentials,
which
are
relatively
new,
seemed
to
do
quite
well.
They
were
either
more
accurate
or
lower
computational
cost,
depending
on
what
you
care
about
on
the
upper
right
graph.
B
The
other
cool
thing
is
that
you
can
see
how
all
these
different
methods
have
different
accuracies
for
these
sort
of
situations.
This
neural
network
potential
is
the
same
as
the
acsf
I
was
showing
before.
So.
B
Those
orange
points
are
also
pretty
good
before
you
get
started
in
one
of
these
areas,
I
would
think
about
what
is
easy
to
implement,
and
I
would
also
think
about
what
is
the
benchmark
data
set
that
is
closest
to
where
you're
doing
so
that
you
don't
have
to
try
each
one
of
these
for
your
system
and
see
which
one
is
most
accurate.
B
As
I
said,
one
of
the
downsides,
this
sort
of
local
approach
is
that
you
don't
get
long-range
forces,
so
the
most
common
long-range
force
is
electrostatic
interactions.
That
is,
if
I
have
charge
on
my
molecule,
a
charged
atom
interacting
with
another
charged
atom
will
scale
as
one
over
r
squared,
which
is
really
scary.
That's
a
very,
very
long
range
force.
B
B
So
this
is
a
a
relatively
hot
area
now
trying
to
add
in
electrostatics
and
long-range
forces.
This
is
a
paper
that
came
out
just
a
month
or
two
ago
by
baylor.
Basically,
they
implement
two
different
neural
networks
with
very
similar
symmetry
functions,
very
similar
representations,
and
they
have
a
first
step
where
they
try
and
predict
the
electronegativity.
B
B
B
Similar
ideas
have
been
done
for
small
molecules,
so,
for
example,
this
was
a
work
by
michaela
teriotti
last
year,
basically
looking
at
other
ways
of
including
long-range
effects
into
these
sort
of
interactions.
B
The
point
I
just
want
to
get
across
is
that
if
you
know
that
you
have
long-range
forces,
you
need
to
be
aware
of
it
and
it's
going
to
change
the
way
that
you
represent
your
system.
So
if
you
have
a
charge
system-
and
you
just
take
a
a
local,
a
local
method-
and
you
just
assume
it's
going
to
work
because
of
machine
learning,
you're-
probably
going
to
have
a
bad
time.
B
The
last
thing
I
want
to
talk
about
with
these
local
features
is
that
some
people
have
thought
about
how
to
improve
the
element
scaling.
So
there
is
a
representation
called
the
weighted
atom
center
symmetry
function,
where
you
basically
add
an
additional
weight
on
these
symmetry
functions.
That
depends
on
the
atomic
number
that
allows
you
to
have
a
fixed
length,
representation
and
scale.
B
The
performance
on
that
for
integrating
materials
hasn't
been
awesome,
but
it
does
partially
solve
the
element
scaling
problem.
Another
example
is
by
john
kitchen
in
my
department
here
at
cmu.
He
basically
showed
that
the
same
set
of
weights
in
the
same
network
could
actually
be
used
for
all
of
the
different
elements.
B
Okay,
one
thing
to
keep
in
mind
is
that
it
has
gotten
really
easy
to
make
a
neural
network
potential,
and
so
these
are
just
the
the
ones
that
I
know
of
off
the
top
of
my
head
and
I
included
links
for
all
of
them
if
you're
interested
just
alphabetical
order,
don't
rewrite
your
own
neural
network
potential
from
scratch,
unless
you're
really
sure
that
something
is
new,
this
area
is
moving
really
quickly.
B
Okay,
moving
on
to
something
a
little
bit
more
complicated
and
a
little
bit
closer
to
what
tess
was
talking
about
last
month.
Graph
methods
got
really
popular
in
about
2015,
with
this
paper
by
ryan
adams
and
elan.
As
for
guzik,
when
he
was
at
harvard,
the
idea
is
basically
to
apply
graph
convolution
networks
to
small
molecules.
B
B
One
zero
zero,
if
it's
an
oxygen
at
zero,
zero
one,
zero,
there's
a
whole
host
of
methods
that
are
all
related,
slightly
different
methods,
but
same
fundamental
idea,
same
sort
of
inputs,
the
same
molecule
net
paper
that
I
talked
about
earlier
from
vijaypanda's
group
does
a
really
nice
job
of
talking
about
how
they're
different
or
how
they're
similar.
B
If
you're
interested,
I
would
definitely
read
those
first.
Most
of
them
are
based
on
the
idea
that
you're
looking
locally
and
applying
convolution
operations
or
you're
passing
messages
around
and
then
trying
to
collect
the
messages
and
predict
final
properties.
B
B
The
weights
could
be
distances
or
something
else.
We
could
include
angles
as
additional
complications,
there's
other
things
that
we
can
put
into
edge
features
like
what
are
the
what's
the
distance
or
what
are
the
properties
of
the
nodes,
and
ultimately,
this
graph
representation
comes
down
to
a
modeling
decision.
There's
not
one
right
way.
If
you
look
in
the
literature,
there's
all
sorts
of
different
ways
that
people
are
applying
this.
B
So
when
you
read
one
of
these
papers,
it
is
not
just
enough
to
say
it's
message
passing
or
not
for
the
actual
implementation.
You
also
need
to
think
about
what
is
the
actual
graph
that
you're
operating
on,
and
how
do
I
apply
that
to
my
system,
so
this
gets
back
to
some
of
the
questions
about
periodic
boundary
conditions
and
representations
that
I
was
talking
about
earlier.
B
One
thing
that
you
could
do
is
you
could
look
at
your
neighborhood
and
you
could
pull
everything
together
and
you
can
say
I
want
to
find
everything
nearby
and
I'm
going
to
take
all
that
information
and
everything
that
is
within
a
certain
distance
or
within
a
certain
shell
gets
counted
as
a
neighbor.
That's
the
simplest
approach,
it's
usually
the
fastest.
B
It
works
pretty
well,
but
one
problem
with
this
approach
is
that
if
you
have
two
different
element,
types
that
have
different
atomic
radii,
it's
not
always
super
clear
what
you
mean
by
a
covalent
bond
or
what
the
radius
should
be.
B
So
we
can
go
from
a
element,
specific
representation
to
something
a
little
bit
more
general
using
an
idea
called
a
voronoi
tessellation
where
we
basically
take
the
atomic
structure.
We
create
bounding
boxes
where
the
rules
are.
I
am
going
to
make
a
line
based
on
the
fact
that
it
is
equal
distance
between
two
points,
so
the
top
left
blue
line
is
equal
to
equidistant
between
the
orange
and
the
black.
B
The
left,
one
is
equidistant
between
the
black
and
the
red.
Lower
left
is
between
the
black
and
the
green
and
the
voronoi
representation.
Then
is
the
area
or
the
length
of
the
polyhedra
that
forms?
One
of
those
little
edges
says
something
about
how
much
interaction
there
is
between
those
two
atoms.
So
in
this
case
the
fact
that
there
is
a
very
large
interaction
between
the
black
and
the
purple,
that's
represented
by
the
fact
that
the
line
between
the
black
and
the
purple
is
very
long
compared
to
the
line
between
the
black
and
the
red.
B
B
The
downside
is
that
these
voronoi
methods
tend
to
be
a
little
bit
more
expensive
depending
on
your
system,
and
how
many
of
these
you
have
to
do
that
can
get
a
little
bit
slow
and
slow
down
your
methods.
B
B
B
In
this
example,
the
zero
atom
is
bonded
with
the
one
atom
and
is
also
seeing
the
one
atom
one
over
and
so
in
the
adjacency
matrix.
It
would
not
just
be
a
one
which
would
say
that
the
two
atoms
are
neighbors.
It
would
be
a
two
in
that
there's
identically,
two
different
interactions
between
those
between
those
unique
atoms
in
the
system,
and
so
this
representation,
I
think,
is
most
rigorously
called
a
mixed
multi-graph.
B
This
is
the
one
that
we're
using
for
most
of
our
representations.
Now
again,
it
is
a
modeling
decision
for
these
graph
methods,
because
these
graph
methods
have
done
so
well.
There's
been
a
lot
of
progress
in
this
area
and
there's
still
a
lot
of
competition
to
make
these
things
more
accurate,
so
one
of
the
ones
I
wanted
to
highlight
that
I
found
especially
impressive.
B
Perfect:
okay,
thanks,
okay,
I
haven't
talked
a
lot
about
material.
So
far,
it's
mostly
been
small
molecules.
If
we
want
to
apply
this
to
materials,
then
the
additional
things
that
we
want
to
consider
are:
how
do
we
encode
the
element?
Type,
so
a
I
would
say,
breakthrough
in
this
area.
Was
this
paper
by
jeff
grossman's
group
called
cgcnn?
B
It's
a
very
simple
convolutional
method.
I
would
say
it's
so
simple
that
I
maybe
I
shouldn't
even
call
it
a
convolutional
method,
graph,
convolution
method,
but
the
the
really
key
insight
was
to
put
node
features
in
that
were
based
on
the
elemental
properties,
so
that
you
didn't
have
to
learn
every
unique
combination
of
elements.
B
This
really
kick-started
a
huge
effort.
This
paper
in
2018
has
already
gotten
a
huge
number
of
citations.
B
B
B
A
lot
of
the
new
ideas
like
dimenet,
internet
and
others
are
getting
implemented
there,
and
that
has,
I
think,
made
things
a
lot
easier
to
compare
and
contrast.
So,
if
you're
looking
for
a
starting
point-
and
you
know
pi
torch,
this
is
my
recommendation
for
how
to
start
playing
around
with
these
representations.
B
B
So
the
best
example
I've
seen
of
this
is
by
tom
miller's
group
at
caltech
published
this
here
called
orbnet.
They
basically
do
a
type
binding
calculation
in
order
to
get
some
interesting
features.
B
This
is
really
cool
because
it's
taking
advantage
of
other
information
from
the
models,
the
downside
is,
you
have
to
do
a
type
binding
calculation.
So,
depending
on
your
application,
I
would
say
either
type
binding
is
considered
expensive
or
cheap.
If
you
talk
to
the
classical
molecular
mechanics
people,
they
would
say,
type
binding
is
ridiculous
and
it's
way
too
slow.
B
If
you
talk
to
the
couple
of
cluster
people,
they
would
say:
oh
type,
binding
is
no
problem.
It's
still
four
doesn't
matter
too
cheaper
than
my
normal
calculations.
I'm
perfectly
happy
to
do
that
either
way.
I
like
the
way
that
they're
thinking
about
this
the
idea
that
there's
additional
ways
of
representing
these
molecules,
besides
just
the
nodes
and
atoms,
there's
actually
electronic
things
that
are
coming
from
the
from
the
calculations
themselves,
getting
close
to
the
end.
The
next
thing
I
want
to
talk
about
is
real
space
convolutions.
B
The
idea
is
basically
borrowed
from
image
classification.
There
are
huge
models
from
google
and
others
on
how
to
apply
these
very,
very
dense,
very,
very
deep
neural
networks
to
image
problems,
and
the
idea
is
basically
if
we
can
apply
this
to
images.
Why
don't
we
try
and
apply
it
to
molecules
and
materials
as
well?
So
that's
great.
B
B
The
real
question,
then,
is
what
is
an
image
of
a
molecule?
What
is
the
image
of
a
material,
so
three
applications
that
I've
seen
one
by
isaac,
timberland's
group
at
the
nrc
in
canada,
joshua
benjio's
group
also
in
canada
from
last
fall
aj
medford's
group.
B
The
idea
is
really
interesting
and
the
methods
are
fast
to
implement,
but
you
often
run
into
the
same
questions
of
that
test
was
talking
about.
How
do
you
encode
things
like
rotational,
invariance,
and
so
a
lot
of
the
difficulty
is.
How
do
you
augment
your
data?
Enough
or
how
do
you
enforce
some
other
representation
in
order
to
make
that
possible
so
that
those
are
the
major
limitations
right
now?
B
B
B
If
I
wanted
to
predict
this
from
scratch
and
give
this
as
an
input
to
a
dft
code,
I
would
also
need
to
encode
things
like
the
lattice
constant
of
lattice
angles
or
the
the
symmetry
of
the
system,
and
so
there's
been
some
recent
work,
trying
to
actually
encode
that,
as
well
as
part
of
the
representation
by
yusung
zhang
antonio
bonacici
at
mit
and
others.
B
So
if
you
have
to
start
from
scratch
and
try
all
these
representations
every
time,
you're
never
going
to
make
much
progress
unless
you
get
lucky,
and
so
just
like
google
and
others
are
pushing
things
like
auto
ml
for
image
recognition.
B
Two
examples
that
I'm
aware
of
are
one
unabove
at
lbl
had
a
very
nice
paper.
This
year,
talking
about
this
auto
map
miner
tool
that
will
try
a
lot
of
different
representations
and
a
lot
of
different
models
and
try
and
come
with
the
best
one.
I
found
that
it
works
very
well
for
small
data
sets
or
simpler
things.
B
It
is
usually
not
super
competitive
yet
with
the
inorganic
large
data
sets,
but
I
think
it
will
improve
as
their
models
get
more
complicated
schrodinger,
which
is
a
commercial
company,
has
a
tool
called
auto
qsar
which
they
published
on.
You
can
read
a
little
bit
more
about
it
same
idea,
but
apply
to
small
molecule
featurization
techniques.
B
I
think
tools
like
this
are
going
to
become
more
important,
because
right
now,
every
time
we
try
this
with
a
new
data
set
and
type
of
representation,
it's
basically
a
phd
level
project,
and
so
that
that
really
slows
things
down.
If
that
takes
months
or
years
every
time
you
try
a
new,
a
new
challenge,
and
finally,
I
just
want
to
point
out
a
couple
of
areas
that
I
think
are
opportunities,
some
of
which
we're
working
on
some
of
which
I
really
wish
someone
would
come
along
and
just
solve
to
make
my
life
easier.
B
So
the
first
one
I
want
to
point
out
is
some
of
the
most
powerful
methods
in
small
molecules.
Right
now
are
based
on
natural
language
processing.
As
I
talked
about
the
thing,
that's
limiting
that
application
to
materials
is
that
there's
no
grammar
for
crystals,
so
the
first
person
to
come
along
with
something
text-based
whatever
it
is
that
encodes
all
of
the
interesting
information
about
a
crystal,
in
a
way
that
you
can
then
apply
all
of
the
standard,
natural
language
processing
techniques
to
generate
new
structures
or
whatever?
B
That
would
open
up
a
whole
new
suite
of
methods
and
kick
off
a
ton
of
work.
I
don't
know
how
to
do
it.
I
think
it's
a
hard
problem.
The
fact
that
there's
not
a
grammar
is
says
that
the
inorganic
material
people
have
already
thought
about
this
and
been
unsuccessful,
but
that
would
really
be
a
step,
change
and
open
up
a
new
area
of
machine
learning
for
materials.
B
The
graph
methods,
I
think,
are
moving
especially
fast
compared
to
all
these
others,
because
the
graph
machine
learning
community
is
very
large
and
very
active
right
now,
so
we
can
benefit
from
what
they're
doing
this
orbital
idea
is
very
cool.
I
haven't
seen
it
applied
to
any
materials,
but
I'm
sure
it's
coming.
B
B
This
is
really
limited
how
we
generate
new
materials,
and
so
I
think,
there's
a
lot
of
opportunity
there
to
apply
the
same
ideas
to
these
other
more
interesting
systems.
Okay.
Finally,
I
just
want
to
highlight
people
who
helped
contribute
to
this,
so
javi
and
brandon
brandon's,
a
post-doc
here
at
nursk
on
the
nissa
program
in
june.
I'll
help
make
some
of
the
slides
for
this.
So
thanks,
javi
brandon
in
june.
B
I
also
wanted
to
highlight
four
students
who
are
applying
for
phd
positions
this
year,
since
this
phd
application
time
so
sudeesh,
sarab,
amish
and
richie
are
all
awesome
people
who
have
been
working
on
machine
learning
and
materials,
and
so,
if
you're
interested
in
some
of
these
ideas
or
machine
learning
potentials
or
whatever
feel
free
to
shoot
me
an
email
and
I'm
happy
to
give
you
their
info.
B
The
community
is
pretty
close
knit
and
I'm
really
excited
with
how
these
things
are
going.
There's
been
a
lot
of
collaborative
work,
so
hopefully
this
leads
to
more
collaboration
in
the
future.
So,
with
whatever
time
I
have
left,
I
am
happy
to
answer
some
more
questions.
B
A
B
Okay,
so
the
first
question
is:
what
should
the
representation
be
for
a
classical
md
simulation?
That's
very
large,
like
20
000
water
molecules
in
a
box.
It's
a
great
question
in
the
classical
md
codes.
A
lot
of
the
work
goes
into
long-range
forces
like
leonard
jones
and
electrostatics.
B
So
if
that
is
important,
you
need
to
be
a
little
bit
careful
there.
There
are
several
different
projects
that
I
know
of
to
try
and
implement
neural
network
potentials
into
md
codes
like
lamps,
so
for
the
short
range
contribution,
basically
the
the
bond
energies
and
the
angular
interactions.
I
don't
see
any
reason
that
those
shouldn't
apply.
I
find
this
very
similar
to
reactive
muscular
dynamics
potentials
where
those
have
worked
relatively
well
for
long-time
scale
md.
I
think
people
are
trying
to
do
it.
B
It's
just
very
time
intensive
to
take
one
of
these
machine
learning,
codes
and
interface
it
with
something
like
lamps.
So
the
idea
is
good.
I,
if
you're
interested
shoot
me
an
email
and
I'm
happy
to
point
you
in
the
direction
of
some
people
working
on
this.
It's
just
a
very
time
intensive
process
because
you
have
to
be
very
careful
with
the
code.
B
The
second
question
is
on
the
cutoff
radius
and
they
mentioned
that
the
the
typical
cutoff
radii
for
classical
md
is
12
angstroms,
but
we're
using
seven
or
eight
or
maybe
even
lower
in
these
unit
cells.
So
what
is
the
impact
of
this?
And
what's
the
problem?
I
think
it's
a
great
question.
B
I
think
one
thing
you
have
to
consider
is
that
you
have
to
include
a
lot
of
information
in
that
12
angstroms,
so
you
can
get
by
with
a
simple
blender
jones
potential
as
long
as
you
know,
every
other
neighbor
nearby.
In
this
case
we
have
a
little
bit
more
information
from
these
neural
network
potentials.
It's
not
just
a
leonard
jones.
It
also
has
some
some
wiggles
and
other
things,
and
so
the
fact
that
there's
correlations
with
what
that
function
should
look
like
and
what
happens.
B
A
lot
of
the
systems
that
people
are
training
on
van
der
waals
has
not
been
super
important.
So,
for
most
of
the
catalysis
examples,
the
dft
codes
themselves
don't
even
include
van
der
waals.
So
if
I'm
running
rpbe
calculations,
I
don't
need
a
12
angstrom
cutoff
radius,
because
the
dft
doesn't
even
have
that
in
there
as
soon
as
you
go
to
a
system
where
van
der
waals
is
important,
things,
like
you
say,
a
large
water
box
or
larger
unit
cells
or
large
molecules
on
a
surface
or
less
covalent
interactions.
B
A
All
righty,
I
don't
see
any
more
questions
at
the
moment.
I
don't
know
if
you
need
to
run
to
another
thing,
just
in
case
any
more
come
in,
but
I
I
do
at
least
want
to
give
them
that
it's
11..
A
Thank
you
very
much
zach
for
this
fantastic
presentation,
great
overview
of
a
lot
of
the
activities
going
on
in
this
space,
a
lot
of
good
stuff
and
good
references
to
follow
up
on
and
with
that.
Actually
that
brings
us
to
the
end
of
the
deep
learning
for
science
2020
program.
So,
just
very
briefly,
I
want
to
thank
everyone
who
contributed
and
attended
the
webinars.
I
think
we
had
a
lot
of
really
fantastic
speakers.
A
We
do
hope
to
be
able
to
get
back
to
an
in-person
event
next
year,
but
of
course,
stay
tuned
for
announcements
on
that
one,
more
special
thanks
to
mustafa,
he
wasn't
able
to
join
today,
but
he
was
really
the
one
driving
the
organization
of
this
event
did
the
majority
of
the
work,
putting
together
the
program
and
did
a
great
job.
A
So
thanks
to
mustafa
and
the
rest
of
us
well
we'll
be
still
on
slack,
so
feel
free
to
continue
using
that
workspace
to
discuss
the
material
or
related
deep
learning
for
science
topics.
We
hope
to
hope
to
see
you
next
time.