National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 2 Oct 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Week 11 - Qualitative Choices in Representations for Molecules, Materials, and Surfaces - Ulissi

Description

Speaker: Zachary Ulissi, CMU
More about this lecture: https://dl4sci-school.lbl.gov/zachary-ulissi
The Deep Learning for Science School: https://dl4sci-school.lbl.gov/

A

Welcome everybody to our final webinar of the 2020 deep learning for science school program.

A

We are uh saving the best for last. We have uh we're very we're very pleased to have zach ulissy with us today. So zack is an assistant professor of chemical engineering at carnegie mellon.

A

He works on the development and application of high throughput computational methods in catalysis, machine learning, models to predict their properties and active learning methods to guide these systems. Applications include energy materials, co2 utilization, fuel cell development and additive manufacturing.

A

Zach has been recognized for his work with awards, such as the 3m non-tenured faculty award and the 35 under 35 award from the american institute of chemical engineers.

A

uh He's been a part of this, our community, the national lab hbc communities for a while he did his phd as a doe csgf fellow at mit, and today zach is going to talk to us about representations for molecules, materials and services, a topic that is of interest. I think, to a lot of us doing science here so with that zach. I think you can take it away. Thank you.

B

Perfect thanks steven, and also thanks, swaheed and mustafa who's, not on the call for inviting me.

B

I was excited when mustafa asked, if I wanted to give a talk on representations, because this is something that has been moving very quickly and I found myself having trouble, keeping track of what was going on in the field and what different people were doing, and so I'm really glad that this was the um the right impetus for me to sit down and organize things and think about what what everyone is working on.

B

So one thing I want to mention first is that this is not the first talk in the in the summer school that talked about representations and so tess schmidt, the one of the alvarez fellows at lbl. I gave an amazing talk on symmetry and equivariance in neural networks.

B

Her slides are online. The video is also on the dl4 slide webpage. Hopefully, most of you saw that a lot of the things that she talked about properties of models and properties of representations that are nice for physical systems will also apply here.

B

I don't think I'm going to go into quite as much detail as she did on exactly how the math of those transformations work but they're, very much aligned and the work that she's talking about is really where I think a lot of the representations are heading and there's a lot of progress in the area. So I'll go over very briefly, some of the things that she talked about, maybe a little bit closer to the end.

B

But if you find yourself really asking um what should the ideal representation be or where is this field going in five years? I would go back and check her talk um as a refresher.

B

So one of the reasons why I think this area is so exciting is because um so much is changing so quickly among these related problems. So in this talk I want to talk about um three main classes of materials and molecules. The first is small molecules, that's where a lot of things are being done for the first time.

B

The second is inorganic materials which are really important for a lot of energy applications and they have their own set of challenges. That I'll talk about, and the third in the upper right is little picture is catalysis and surface science, which is very related to the first two but sort of combines both the challenges and I'll talk about that in a little bit.

B

So um on this slide, I did my best to try and reproduce what I've seen in literature. I'm sorry if I missed something specific what I've done is on the x-axis I just plotted roughly when there seemed to be a burst of papers or when the seminal paper seemed to have been published for the area and for the type of material on the y-axis. I have a qualitative breakdown on different types of representations that have been used.

B

Some of these are a little bit blurry, so, for example, as I'll talk about atomic environments and graphs sort of overlap, and it's not always super clear which bin they should fall under. But there are some differences, and one thing that I find very interesting is that if you look at the x-axis, with the exception of small molecule, fragment models, almost all of these are really 2005 to 2020.

B

and a lot of the newer ones such as thinking about how to apply real space, convolutions or orbitals as representations.

B

Those have really been in just the past five years, so I don't think I've ever put together a talk that has as many links to archive as I have in this um in this one today, just because so many things are moving so quickly and there's so many papers, and so many things going on. So it's very exciting.

B

At the same time, it's a little bit overwhelming, because when we sit down with our system of interest, whatever we're trying to do, um we have to make a choice for where we're going to start and what class of materials we're going to go. For. So, for example, if I have a um if I have a student join my group and we're talking about how to try and solve some problem, I might give some idea on which of these qualitative things.

B

They should think about and we'll sit down and research the area and play with their data and apply some models.

B

But the high level perspective of what is really the best and how are things moving has been very, very difficult to keep track of, and it can make a really large difference in the sort of behavior that you get out. I think, as tests sort of showed in some cases, very subtle changes in the models of the representations can have very large impacts and what sort of behavior you can you can see?

B

Okay, so uh today, what I want to do is first talk about why there are different challenges in small molecules, materials and surfaces, and each of those subfields have their own issues and questions and the way that they think about the world impacts the way that they represent things. And so, if we don't understand where they're coming from or what challenges they're trying to solve, we're not going to be able to get the context for why they're using different types of models.

B

The second thing is: I want to talk a little bit about how to quantitatively compare different representations and there's been some very interesting work on different ways of analyzing, especially as you go to larger and larger data sets to see how they're learning and how powerful is the representation.

B

Then I'll go through each of these different qualitative representation classes that I just showed in the previous slide and then at the very end, very briefly, I'll talk about some recent work that has been coming out on how to automatically try a lot of these representations and find the one that works best for your system like most automl things it works sometimes, but is not always the best and then a little bit of uh future outlook and where I think there is progress and where I think there needs to be some development in order to enable some new areas.

B

Okay, so I want to start with small molecules, because that's really where a lot of these representations are coming from, so the things that govern small molecules are really um mostly models of simple hydrocarbon systems or oxygenates. So these are small molecules by that I mean usually less than 20 heavy atoms. A heavy atom is anything, that's not a hydrogen.

B

Usually it's either carbon nitrogen oxygen. Sometimes you also have sulfur. In there the space is pretty overwhelming. So there was a really well known work in about 2010 that brute force enumerated every small molecule under 17 heavy atoms, and with that brute force, enumeration you're already at 160 billion possible molecules. So that's wild um 17 isn't even as high as you go right.

B

It would be very easy to add more things onto this molecule in the upper right, so this is really a combinatorial space, if you add in any sort of metal center- or you start talking about, um for example, sulfur- I don't think that was in the original one that gets very, very complicated very quickly.

B

At the same time, the number of elements that are considered with these small molecules is very small, so, like I said um really, it's focused on cno sh, sometimes fluorine, because fluorinated compounds are pretty popular but uh really there's only four or five or six, depending on whether or not you count the hydrogen in your representation or not, and that changes the way that people think about these models.

B

It is okay to learn why nitrogen is different from carbon, simply through brute force of showing in enough situations where nitrogen is different from carbon in the inorganic material spaces. I'll talk about, um there's many different elements, and so that changes the way that you have to think about the problem a little bit.

B

Another thing to keep in mind is that in the small molecule space, the computational chemistry methods are really well developed. There's a small number of atoms, there's also a small number of electrons, and so that means you can use very accurate, computational chemistry methods and they don't take a huge amount of time. We can run dft on a molecule like this and it might take order minutes.

B

Tight binding might be order seconds if I think about really complicated methods like coupled cluster, that might be hours or days.

B

There are people who do really um uh top level calculations like um quantum monte carlo, where you, you basically get the exact answer, um and it is possible to basically go to that level and say what is what is the exact answer? um So there are a lot of really large data sets that have taken advantage of this, the scalability and this relatively fast compute, and that means that the data sets here are very large.

B

um They're they've been around for a while people have been doing high throughput calculations for small molecules for a long time. There's many different, self-consistent databases available. Most of them are more than 100 000 molecules, there's already some that are more than 100 million, which is just wild if you think about 100 million dft calculations.

B

At the same time, there are some things that make this problem really hard and um specifically it's the fact that people really care about entropy and fluctuations, and so those two things are sort of difficult to capture with simple dft calculations, and the other problem is that for a lot of biological systems, because things are so so targeted in this area. Very small energy differences are important, so maybe on the order of 1k cow per mole, this is much more stringent than in a lot of inorganic materials.

B

So um there's a lot of data and the methods are good, but at the same time you also have to be really accurate for people to trust your results and the obvious application areas for these small molecules, the ones that are driving most of the materials discovery. Efforts are things like biofarmer, polymer design and synthesis and organic photovoltaics.

B

That's where you see most of the databases. So let's go through a couple of these properties quickly.

B

So I mentioned that um a group had already brute forced the number of small molecules, and this data set is freely available. It's called the gdb17 dataset, it doesn't have energies or forces from dft. It is just an enumeration. This was in 2013..

B

It contains a lot of other data sets basically inside of it, because a lot of them were just sub samples and a lot of the data sets that you see are basically people taking things from gdb17, subsampling them and then running dft calculations.

B

So, for example, the qm7 data set is the gdb17 dataset after you select only things that have less than seven heavy atoms. The qm9 data set is much larger, but it's the same idea. um Just you select everything up to nine heavy atoms instead of seven and there's a whole host of small molecule data sets that are all based on this original gdb17 one. It's nice that someone already did the hard work to say what is available.

B

It also means that afterwards, you can go back and say of all of the small molecules that we know of up to 17. What is the one that is best for my application? That is something that's really hard in other areas where you can't enumerate absolutely every possibility.

B

Okay, the second thing that I mentioned is that really small energy differences are important here. So as an example, I want to think about a a simple c4 molecule, so this is two methyl groups on two ch2 groups linked together.

B

And if you look at this, you can see that there's actually a a rotation, that's possible, I'm going to swing either the right or the left, functional group around and um depending on how I do that, I'm going to get different energies, there's going to be different local minima and there's also going to be a barrier for going over that. So the one on the left is different from the one on the right, because the ch3 groups are tilted by 60 degrees.

B

If you look at the energy profile of this rotation, we can see that there are three different energy minima. It's rotationally symmetric, so it goes from zero to 180 and back.

B

The other thing that we see is that the difference between local minimum is very small noticeable, but small. So if I had the ch3 groups, apart or across from each other um versus right next to each other, there's a difference of about four k, cal per mole um or four kilojoules from all, so this is a relatively small energy difference and if we want to be able to capture this, we need a method that is at the four kilojoule per mole level.

B

These conformations are really really complicated to capture and there's a lot of these degrees of freedom. This is a very simple case and a lot of these larger molecules there's many many different bonds that you can rotate around and get different, distinct local minima. So that makes this problem really hard.

B

Okay, so that's small molecules. The next thing I want to talk about is the work. That's been done, the material science side really driven by efforts like the materials genome project, so these materials are come from a very large set of possible spaces. There are databases of possible experimental data sets already available.

B

um The computational data sets have gotten much more popular recently, there's a lot of sort of order, one to five million enumerated structures out there. But this is really just a very small subset of possibilities.

B

um No one can go and generate every possible crystal structure with every possible elemental composition, because it's just too large and it's just a combinatorially large space. So um there is no amount of work that we're going to do to make something like the gdb17 for all possible crystal structures. What we can do is look at one of these large databases and select from there and that's usually a good starting point, but that makes things difficult.

B

The real challenge with inorganic materials is that the number of elements goes up from five or six to pretty much the entire periodic table, and so one of the things I want to really drive home as we talk about these different representations, is that some representations scale well with the number of elements and some scale very poorly.

B

If it scares scales poorly, then it's okay to apply to small molecules or specific subsets of composition space, but as soon as you have to try materials across the entire periodic table, you run into challenges and that changes how you think about the problem.

B

The computational methods are fairly well established. Most of these crystal structures are pretty small, so that's good um dft works fairly well. For most of these, there are well-known issues when you go to things like large, crystal structures or you want to incorporate disorder or entropy or with some very specific classes of materials. Like oxides where dft does not do a good job of describing the behavior, but for most of these crystal structures, the the methods are fairly well established.

B

um The data sets are getting pretty good, there's already several large computational databases at the order, hundred thousand, maybe small, number of million size. So that's excellent.

B

You can get really powerful models of that size um and I would say the key challenges here are um one there's a lot of properties that we want to consider, and so, um in addition to the normal things like stability, we also want to capture things like mechanical properties, which are sort of more complicated calculations or thermal or electronic properties.

B

That can be a little bit tricky and require different levels of theory. Periodic boundary conditions are also really important, and I'm going to talk about. Why that's an issue it's a little bit silly, but it is a way to distinguish what's being done in this community, from what's being done with small molecules and that often limits the representation that we use.

B

The accuracy is a little less stringent than small molecules. Typically we're happy with something on the order of 50 milliliter per atom accuracy. So that's a little bit better than with a small molecule case and the driving applications. The reason why people are pushing these efforts are really energy materials, photovoltaics, thermoelectric batteries, most of the big screening efforts are in those spaces, so I want to start with periodic boundary conditions for those who aren't super familiar.

B

Most people see these sort of arguments when, if you've ever taken a course on molecular simulations and you're trying to write a small md code, this comes up right away.

B

The idea is basically, we want to represent an infinite system, so a really large nanoparticle or a really large, extended bulk material, and we want a very small calculation to do so. We repeat that thing over and over. This is an assumption that we make. It is not always a great assumption.

B

You need to make sure the the unit cell is large enough to capture the complexity, but this is really the basis of most of the calculations in inorganic materials.

B

The idea is basically every atom can see an image of every other atom being repeated around and usually we just think about one unit cell to the right or left or up or down so, for example, in this picture on the left.

B

The central red atom sees three blue and one red and that red atom that it sees is actually an image of the red atom inside of its own cell. Just repeated over one cell.

B

The cutoff radius in this case, um if it is less than half of this unit cell width, the atom, cannot see itself, but if this gets large enough, it is possible for it to see itself, and that becomes an issue, um that's something that we have to think about, and we have to make sure that our representation captures most of the cutoff radii. That I'll talk about today are order four to ten angstroms, usually between four to six. So usually, this is fairly local and usually the unit cells are a little bit larger than that.

B

um The reason I bring this up is because um it's really easy to describe these periodic pattern conditions. It's easy to think about it. When you run your dft calculation, it will take care of all the stuff for you, but it is also really really easy to make mistakes and it's easy for this operation to be slow, and, I would say um my own research group and experimental collaborator or theoretical collaborators in related areas.

B

We spend an embarrassing amount of time worrying about our ppc implementations, even though it is relatively simple. So if you um are taking a small molecule code and applying it to materials, this is something you're going to have to worry about. It's probably the first thing that you're going to have to deal with another thing that comes up is a modeling question of what do you do after you apply your periodic planning conditions so um on the left here, I've shown a really simple, crystal structure.

B

Let's say all the atoms are exactly the same, but inside of the unit cell there is one central red atom and then there is a another blue atom in the unit cell that is being repeated around, so there's basically two different atoms in this representation. This is like, if you went to the materials project and said, give me the cubic cell representation.

B

It would give you something that has two atoms, even though they're all the same atom, and if I draw a little cutoff radius of four angstrom, that's larger than the bond radius of about three and a half. What I see is that this red atom is could be considered a neighbor for all four of these blue atoms.

B

Okay, if I just make arguments on how is it covalently bound, I would say there are four bonds for that red atom. Probably there's a modeling question, then or representation question on what do I do with this bonding information? So I could say I could reduce this down and I could say every atom is bonded with itself four times in different representations if it is the same red atom being repeated over and over again, a common assumption in some of these graph representations.

B

That I'll talk about is that you just pick the minimum image convention nearest neighbor so for each atom type. I look at the nearest image of the second type of atom. I choose the one that is closest and that's the one that I put into the representation so for the same system. I would label this as a red.

B

Atom is labeled with is a boundary itself if I don't reduce for symmetry now those two types of atoms, even though they are essentially the same under symmetry, and in that case I have sort of the same question. I could call this. Every rad red atom is bonded with four blue atoms and vice versa. Right, if I look at a blue atom, I can say there's four red atoms around, and I could also say that if I just look at the nearest image convention, um every red atom is bound to a blue atom.

B

So my suggestion, when you move forward with these things, is to really think about what happens under periodic random conditions. All things considered, I would, I would suggest, trying to include the one that has all the bonds present.

B

So that's the one that um wouldn't change after you repeat it under periodic boundary conditions, um but this issue comes up a lot and when you read papers on graph convolution methods that I'll talk about this is a common modeling question that partially explains why some models work better or worse than others, and this is worth keeping in mind again, it's very simple, but it is a real logistical challenge when dealing with these representations.

B

Okay, for inorganic materials. There are a couple of large data sets. um I chose just two based on ones that I am particularly familiar with. um The first is a flow lib run out of duke university, and the second is the materials project run out of lbl and uh berkeley by christine person and others. um Both of them, I would say, apply, sort of the same way of thinking about things. Calculations are fairly similar, aflolib tends to have more enumerations of the same types of structures.

B

Materials project usually has um is a little more driven by what might be experimentally relevant, but both are very similar ways to thinking about things.

B

I highlighted the materials project just because when you read data sets on representations in material science, most of them use the materials project data set as a benchmark. So that's the one that people have chosen to use, not for any particular scientific reason just because um it's easy to download from and people are familiar with it.

B

Okay and finally, I want to talk about why surface science and catalysis, which is really where um I spend most of my time in my research group, um is, is so complicated and why this has been really challenging for me to think about these representations, so the possible space of materials or configurations that I need to think about is really overwhelming, and so the problem is. I basically take all of the diversity of small molecules that I just talked about. There's 160 billion small organic molecules that I can put on a surface.

B

I combined that with all of the possible diversity of all of the inorganic materials that I could choose as a catalyst, and I add to that the fact that every one of these crystals can be cut in different orientations, so there's different possible surfaces, determinations that can be exposed, and so, if I want to really solve a catalysis problem, especially for one of these complicated systems, I need to think about all the materials all the surfaces, all the ways that I can put a small molecule on the surface and all the different configurations of the molecule.

B

So this is all of the hard things about small molecules, combined with all the hard things about interconnected materials which has made it really really hard to make substantial progress in this area. The number of elements is same as inorganic materials in catalysis. We consider everything.

B

The accuracy of the computational methods are also a limitation, so, for example, when you start to consider an extended periodic system for a surface, usually the number of atoms goes up. It's common to do, 20 to 100 atoms, which is a little larger than inorganic materials. Usually is that makes dft reasonable, but a little slow and there's not that many experimental benchmark methods or data sets.

B

um Charlie campbell at university of washington is really the leader in those efforts, um because there's so few numbers that we really 100 absolutely know, there's a lot of different competing methods, so um you'll read papers in this area and people will use pbe or rpb or v fan to walls or hybrid methods or rpa, and it is really hard to say exactly what is the right answer. Besides, as you go up the chain to hybrid and rpa, it probably gets more.

B

Accurate disorder and um uh large nanoparticles are both common things we want to think about in catalysis. um We often consider oxides. So all the problems with oxides I mentioned from materials also show up here. The data sets are really small compared to materials and small molecules, but they're growing, um all the ones that I'm aware of are less than 100 000 structures and most of those have been published in the past year or two or so.

B

So we have all of the diversity of the first two, but our data sets are orders of magnitude smaller, which is a problem um common challenges. uh There's a lot of reactions we need to consider. The accuracy is not super stringent, usually plus or minus. 0.180 is okay and, like I said, the things that are driving are energy materials, but on top of that, there's applications to manufacturing fuel cells and batteries.

B

So let's go into a little more detail. So for one of these surfaces, um I'm thinking about all of the possible intermediates that I could have in a possible reaction pathway. So this is a paper that I worked on when I was a postdoc a few years ago, we're looking at a relatively simple system of co and hydrogen gas phase reacting to selectively make one of a number of possible products which could be ethanol, methane, acetaldehyde, methanol, water co2.

B

Ideally, we want to make something valuable like acetaldehyde or methanol, and we don't want to burn into co2 and water even for the simple rhodium system. This is one metal flat, surface, no complexity. There are thousands of possible pathways that I could write down and finding just the reduced pathway on the right means. They have to consider all the possible intermediates and all the possible reactions, so that gets really really time consuming for any individual intermediate.

B

I have to do a series of calculations where I watch these adsorbates move around and find the most stable configuration. So this is an example of an oh on a nickel gallium site from some unpublished work. um I guessed it should be on a nickel green and it looks like it sort of moves over to a nickel gallium bridge. um This is actually sort of moving sites and it's changing configurations. So it's very dynamic.

B

It's hard to choose the right representation here and in order to get this right, I don't just have to do this calculation. This is a local optimization. I have to move the oh around in a lot of different places and find the lowest configuration. So that's very complicated.

B

This is also fairly expensive, and so this is nurse. So we have to talk about everything in terms of service units so for this small molecule sitting on a extended surface, we're talking somewhere between 1 and 20 000 service units, which is pretty expensive. As far as these things go.

B

Okay, the same idea of small molecules on inorganic materials comes up over and over: it's not just thermal catalysis. It's also co2 utilization, water, splitting hydrogen storage, um selective catalysis, water, desalination and remediation polymer metal interfaces, corrosion resistance. All of these are basically the same fundamental question of how do small molecules interact with inorganic surfaces and again it's the same hard problem that shows up everywhere.

B

Okay, so with those ideas in mind, we can start to think about. How might we compare different types of representations for the system that we care about, so I want to start with small molecules, because again, that's the most established and there's been a lot of work in that area on the left is a picture of some work um on developing machine learning models for how two different molecules might be similar to each other using the qm9 data set. This is the first paper to really say what qm9 was.

B

They were the ones who did those calculations? That's only 2012. Only eight years ago on the right is sort of a review, high-level benchmark paper called molecule net by vijay panda at stanford and um what they did was they compared a lot of different methods, and this is a very um similar thing that you would see in any other area of machine learning we're interested in how does the accuracy change as a function of the training set size? So this is a simple learning curve. This comes up over and over um that's great.

B

um All things considered. I want to be as low as possible. So deep tensor, neural networks and kernel rage seem to be doing really well cool. What is fascinating about this area to me is that you can analyze these trends and, on a log scale, there's a lot of very interesting properties.

B

So I'm going to take exactly the same learning curve and I'm going to re-plot it on a loglog scale um on the x-axis it'll, be log of the training set size on the y-axis it'll be log of the mae, and what we see is that a lot of these methods and I'll show you ones for a lot of different things for qm9 in a second. A lot of these methods are very linear.

B

So, surprisingly, linear in this log space, the slope of the method says something about what is the effective dimensionality. So we want this thing to be as low as possible, so shift the whole thing down. We also want it to be as steep as possible, which means adding more data makes it grow, and we can often get some insight into what's going on with these systems by looking for features like a plateau in this space usually implies that there's multiple data points that all have the same representation. So that's something that we can see.

B

This sort of analysis assumes that you have a uniformly sampled data set and um if you have biased data you're going to get different curves. So this isn't the something, the sort of thing that you can just do for the model. This is really the model, and but it does give us some idea of not just is one model more accurate than another at 100 000 data points. But how is it scaling and is the representation actually more powerful or not?

B

So I'm going to take the same idea and I'm going to apply it to a lot of small molecules.

B

um A lot of different uh small molecule models and anatoly van linfeld at university of basel now at the university of vienna, has really driven a lot of this. He has an awesome presentation on these sort of ideas, for small molecules with the link at the bottom. um This paper is this. Chart is from his slides and he has a couple of related papers, and what we see is that there's a ton of different methods that people propose to small molecules, all with different representations.

B

You can shift up and down and there's also different slopes so qualitatively right away. We can see that there's already two different classes of methods. um One is these ones in the upper right things like vagabonds, which I'll talk about.

B

They have a lower slope than a lot of these other methods that are newer and a little more complicated and include more properties in the feature set so qualitatively. The fact that those two slopes is different says that the red one is the lower red one is probably more powerful because it's scaling better. We can shift that whole line up or down by playing around with how we train or how we do the hyper parameter optimization, but it's pretty rare that you actually change the slope by playing around with the parameters.

B

The second thing that's fascinating to me about this is that these lines are remarkably linear, so I wouldn't have expected without having seen this sort of work, that things would continue to decrease at the same log rate as you increase the data set size.

B

This is very powerful, because what it says is that you can do small data set training. I can do a hundred and I can do 500 and a thousand, all of which train fast and all of which I can train on a google collab instance or my own laptop or a desktop, or something like that, and the scaling of that says something about how accurate you might be at a hundred thousand.

B

That is really powerful right. That says that you don't need to be doing these really crazy trainings in order to develop better methods. So that's really cool and um not something, that's obvious in all machine learning areas. It seems to be something about um these large diverse well-sampled. Unbiased data sets okay. um We can do the same thing, not just for small molecules. We can also do it for inorganic crystals, so this is a.

B

This is a chart that I made for some unpublished, collaborative work on the left are small molecules, basically the same as before qm9 with two different types of models, same models that I was showing here on. The right is the materials project, energy formation data set with two common methods that I'll talk about called cgcnn and what's interesting, is that the same method applied to two different data sets or two different regimes yields two different slopes, and so this also says something about what is the effective dimensionality of this problem on the left?

B

With this qm9 data set, the slope is minus 0.57, it is about twice or half the dimensionality of the materials project data set, and I think a lot of that is coming from the elemental diversity. So this sort of analysis not only helps you compare different types of representations.

B

It also helps you get an idea of just how hard is your problem compared to these other areas and how big of a data set? Are you going to need in order to really solve the challenges that you're talking about? So this is something that I've been really excited about this year.

B

Okay, with that in mind, I'll start jumping into representations um yeah, I think before I do, that yeah there's a question for some.

A

Yeah trying to interrupt there was one question on those uh plazas that you just showed. um Do you want me to read it or do you want to read it yourself? I.

B

Can see it um so the question is: um what does the y-axis mean here? The total energy or thermal chemistry data? Let me uh go back, um I think you're talking about this one is that right, yeah, that was it cool okay, um so the y-axis here is, I believe, the formation energy or the atomization energy, of these small molecules, and that is basically the energy.

B

The atomization energy is, um I take a small molecule and I calculate the energy and then I keep on dilating it, making it bigger and bigger and bigger until all the atoms are really spread out and don't see each other, and that is sort of the same thing as a cohesive energy in small molecules.

B

So those are the two metrics that people usually use with these qm9 data sets. You could just as easily do the same thing for any other property like polarizability or um uh maybe how much it like some solvent or some electronic property like the band gap or the homo aluminum gap or whatever else you want. um Those are all common, but the formation energy is the one that people tend to use when developing these new methods.

B

If anyone else has a question on the representations of the challenges, I'm happy to take a second and answer. It.

A

Now it can usually take folks a moment to write their questions, but you can only yeah.

B

A

Answer them later too,.

B

I'll give a minute: it's already been about uh half an.

B

Hour um so I'll go ahead and continue, but if I see anything pop up, um I'm happy to take some time and chat and discuss a little more okay.

B

So, um let's start thinking about different types of ways that we can represent these these materials. So the first one that I want to start with is the simplest, and that is a composition feature where we're just looking at the types and number of the different elements.

B

So this is especially common in material science, um usually experimentally, if you're making something. You might only know this, how much of each material went into it. You might not know exactly what the crystal structure is or other things that we care about for some of the more complicated models, and so the general idea is that you take the composition on the left. So maybe some binary oxide or some other ternary inorganic material, and you want to enumerate a lot of features of each one of these element types.

B

So I might have a set of features for silicon and atrium and fluorine and oxygen, and whatever else I care about there's often a feature combination step to mix those things together like maybe I will find the average atomic mass by weighting the mole fraction of each one by the atomic mass of each element and then those final feature combinations go into something like either standard supervised regression or some other standard ml technique.

B

After the feature combination stage, we have a fixed length factor that we can play around with the pros of this. Are it's very simple? It's physically motivated. I only need to know the composition, um it's also very simple, and um because it's so simple, it often works with very small data sets that's cool. um The cons is that, obviously this composition doesn't allow me to specify why one polymorph of this binary oxide might be different from another polymorph with the same stoichiometry, so it can't handle polymorphs.

B

It won't tell me why certain structural features are important and if you apply this to a very large data set like the materials project formation energies, it tends to perform quite a bit. Worse than the more complicated representations we're going to talk about later, there are a couple of libraries that people use in this space over and over. So um the most common one is the set of descriptors called the magpie descriptors from chris robertson's group at northwestern university, published in 2014 um bryce meredig was one of the students on that paper.

B

Bryce is also the cto of situation informatics, which does a lot of work with similar problems in material science. Now, basically, they went through and collected a bunch of different properties and some standard combinations, so some combinations of the types of elements and the fractions elemental properties, like the mean absolute deviation, minimum maximum mode whatever for things like the atomic number or the radii or number of electrons or a bunch of other common ones. They have electronic structure attributes and they have ionic control ionic compound attributes.

B

Not all of these are important for every application, but this is pretty comprehensive and the hope is that something in these magpie descriptors might actually be useful for the actual for your actual problem at hand, and so the goal of the machine learning method is to then say which of those features is particularly important.

B

There's a lot of implementations now, because they're very common there's a set of code for magpie itself, um johannes hackman, has done the same thing at the university of buffalo um anubhav jain at lbl has a bunch of these implemented in auto.

A

B

Miner, which I'll talk about a little bit later, but I would say a lot of things really rest on these um the same set of descriptors at this point.

B

You can take these descriptors and you can find linear or nonlinear combinations that might make even better descriptors, and so this is very common in the material space, um material, scheffler and others have really been driving this luca gerangeli.

B

um Basically, you search for some nonlinear combination of these that describes the property that you're interested in and so in this example. They basically found that, with these two really complicated descriptors, they were able to really well separate what was a metal or a non-metal, and this is cool because you can see the algebraic formulation.

B

I have some idea of which properties are important and it does quite a nice job of actually distinguishing those two.

B

And so, when you deal with these composition features one of the really common things that you do to improve the representation is run it through a code like ciso, which tries to find the best nonlinear combination to help you with your with your problem. This is this is really common. So that's why I bring it up.

B

The next type of representation I want to talk about is fragments which are really common. In the small molecule space and less common in inorganic materials. uh The idea is super old. We can look back to even the 50s.

B

The classical paper is um by benson when he was at usc, and it's typically referred to as benson group additivity, and the idea is basically that I am going to represent the energy of this. Maybe the formation, energy or something else of the small molecule by adding up all the contributions from different subsets.

B

This is motivated by looking at something like an alkene chain, so on the left bottom. um This is formation, energies heats of formation for alkanes of different lengths, and what we see is that, as we add more and more ch2 groups to the middle of this, the extra heat of formation always increases by about 4.9.

B

So what this tells me is that the intrinsic formation energy of a ch2 group in an alkane chain is probably 4.9.

B

The way that I apply this to a larger molecule is that I go through and find all of the unique types. So there is one methyl group with a carbon, so that's labeled one or there's two of those one. On each side um there is one carbon that has two methyl groups and another carbon nearby and I'm going to represent the total energy as a linear combination of each of those independent fragments.

B

This idea works really well, it's super common in thermochemistry, it's been around forever. Fragments are basically based on this idea in this argument.

B

The more modern analogy of this for small molecules is something called: morgan, fingerprints or extended circular fingerprints abbreviated as ecfp, and the idea is basically the same, I'm going to take the small molecule and I'm just going to break it up into little bits, either based on just radius 0, which is just the number of elements or radius 1, which is the central atom and something nearby.

B

If I increase the size of the radius, I will get larger and larger fragments. So um each of these is now a fundamental thing that I'm going to build my model off of. I could build a linear model. I could also build a nonlinear model and one common way to take all these different fragments and turn it into representation.

B

As shown on the bottom left, you basically take every fragment, you do a hash and you hash it to a certain point in a fixed length vector so that multiple fragments might share the same hash but the if it's long enough, the likelihood of that is small, and then this fixed length vector this binary vector, becomes a reasonable representation for the overall molecule.

B

This is, then, something that you can feed into a machine learning method to try and predict some final property identical idea, just instead of it being a linear model in terms of the fragments it could be nonlinear and also, instead of just being on radius one, you can do larger fragments.

B

This is good, and that is simple and physical. The downside of this approach is that it scales very poorly with the number of elements or number of fragments, so every fragment is considered different here. There's no intrinsic idea of why some of these alkane, like fragments, are similar to one another, they're all considered completely independent, and so there's no way of combining those things together. If you see a new fragment that you've never seen before in a hypothetical molecule, you're sort of stuck- and that makes it challenging.

B

A related idea, I would say also in the same fragment approach, is something called the vagabonds method where, instead of focusing on the elements of the atoms, instead, I focus on the bonds or the type of bonds, and so basically, what you're going to do is you're going to go through and you're going to collect all of the radii, all the bond distances for every type of bond, so that could be an oxygen oxygen bond or a co bond or a cc bond or whatever else you want, and I'm going to take all those radii, and I am going to stack them all together into their respective groups and that is going to become a fixed length representation of my molecule again, very simple idea.

B

The idea is just that there are some intrinsic properties of certain bonds that can influence molecule reactivity, there's also a correlation between this representation and which molecule are you talking about.

B

Fundamentally, very similar, if you see a new bond type that you haven't seen before you're going to have trouble.

B

Okay, these approaches work really well in small molecules and if you want to try out these things, my suggestion is to use a tool like rd, which is open source, really helpful, um we'll do all sorts of different, um simple, fragment, based fingerprinting methods. um I wouldn't write this sort of thing from scratch. There's already really really good methods for doing this.

B

um Okay, let's take a second before moving on. I see another question um on slide: 20 there was ma versus training data set size, and you talked about good ml and bad ml.

B

Okay, um so let's take a step back and then I'll come back to this one. So that's 29! Okay! So let's go back to the uh good mlm batman, okay, the problem in this case.

B

The reason why it's saturating is that um if you have two molecules that have the same representation and one has, for example, an energy of formation of five kilojoules per mole and the other has a uh formation energy of 200 kilojoules per mole, but they have exactly the same representation, no matter how good your machine learning model is neural network gaussian process, kernel method, whatever you want it doesn't matter, um it cannot distinguish between those two. It has exactly the same representation, but it's labeled in two different ways.

B

So the best that you can do is you can guess halfway in between and um you're bad at both that is usually one of the driving drivers for the sort of saturation behavior, um and it's basically an idea that your model is not complicated enough to distinguish between different things that have different properties.

B

The same thing comes up, and I think this is a good question. The same thing comes up with composition, descriptors, for example. I could have a lot of compositions that have very very different energies and from a representation like this, I would have no way of distinguishing them. The model would have to label all of them the same energy and if I'm trying to label a bunch of polymorphs and they all have different energy but the same composition, I'm stuck right away. um My my model is not going to get any better after.

B

I add some more data, because I cannot distinguish between the things that I've already seen. This comes up a lot. It's a good way of diagnosing either issues with the data set or issues with the representation. The same problem can come up. If you have bad data, where you have two of the same molecule and they're labeled differently, you get a very similar sort of behavior.

B

Okay, cool so moving on a little bit. Another very common representation in small molecules is a grammar or text-based approach based on the idea that there are already languages for how to describe a small molecule.

B

So I can talk about something like the smiles of a molecule depending on who you ask, I would say that it's easy to read or not. Basically, you go through that string and it says how are these atoms bonded together?

B

Implicitly, there's a bunch of hydrogens around you can say there are functional groups if there's a double bond, it's an equal sign. If it's a single bond, there's nothing.

B

This is an established language that the organic chemists are already using to describe one molecule versus another.

B

It is very powerful because there's already a lot of really good machine learning, models that operate on text strings, and so people have um spent a lot of time applying natural language processing tools to these sort of smiles representations.

B

One downside to smiles is that if you generate a smile string, there could be multiple smiles that all correspond to the same thing. So it's not necessarily unique. So that makes things a little complicated.

B

A second problem is that if I just generate some random string of numbers and letters, so o um o c c, whatever um it's possible to generate a string that does not decode to a real molecule, and so that means that if your natural language processing tool is just spinning out um strings, some of those might not even be molecules. It's just making up nonsense.

B

A lot of the progress in small molecules has been driven by better representations or by basically improving the actual grammar of the string itself. So um this selfie's representation by alanna spuro-gruzik's group is another string-based representation that basically um enforces it has some nice properties such that it always decodes to a small molecule.

B

So, no matter what string you generate, you can decode that thing, and it will be something interesting. So that's very cool. That means that I can apply all the super cool stuff going on in machine learning like bert or other nlp models, and I can apply those directly to this area. I can also apply generative text models to small molecules. That's cool!

B

The problem is, there's no such thing for inorganic materials or things like catalysis. So this is super cool for small molecules, but it doesn't scale.

B

One of the cool things that you can do because of these sort of representations is, you can come up with um variational, autoencoders or other generative models or gans, and you can generate new small molecules that are basically hypothetical things that you should try and test. This has gotten very hot in the past couple of years. It's really interesting.

B

A lot of it again is driven by the fact that it's very easy to do these things, because people have already done it for text and other applications.

B

It's hard to apply the same idea to materials.

B

The next type of representation I want to talk about is a little bit more complicated everything. Up until now, I haven't really talked about bonds or angles or other complicated features, and there is a whole host of methods that try and look at a atom and its nearby neighbors and try and come up with a representation that describes what's going on locally.

B

So one of the most common um is something called a high dimensional neural network potential or a atom centered symmetry function or a baylor paranello machine learning potential. All of those are the same thing. The idea is, basically um you take the each atom. You look at its neighbors, you try and come up with a fixed length representation.

B

You take that representation, and maybe you use it to compare it to another structure or you apply that to some machine learning model and um you try and predict the energy and sum of those things up. There's some small differences in how that gets done, but the same idea comes up over and over these baylor paranello or hdnp potentials have been around since 2007.

B

So the idea of doing machine learning potentials is already almost 15 years old to come up with the strict length representation. What you normally do is for every entry in this fixed length vector I'm going to use it to describe a certain type of interaction.

B

So, for example, one of these entries in this vector might be all of the copper copper bonds. I am going to take all of the radii for those bonds. I will use it in this little lookup table with some ada specified. So let's say eight is four, and the radius is three: I'm going to look up that value and it's about a g2 of 0.15 or so, and I will do that for every such copper. Copper bond.

B

Add them all together in my local radius, cutoff environment and take the sum put in the vector I can do the same thing for angles. uh This has to be done for every unique combination. So I'm going to take all of the copper carbon copper angles and apply the same lookup table idea to the theta that I get out of that angle. Computation and I'm going to add up all of those that happen nearby me and I'm going to shovel those into another section of the vector.

B

This idea is very cool in that it. These features, look a lot like the sort of features that would go into a classical machine, classical potential.

B

But one major downside of this approach is that if you have to do this for every unique combination of elements, so copper, copper, copper, carbon, copper, oxygen, whatever the length of the vector can become cubic scaling in the number of elements, and that makes it really difficult to apply to things like inorganic materials across the entire periodic table.

B

So these local environment, fingerprints are really really powerful and really common in either small molecules or in inorganic materials. Where you have a small number of elements right, this is sort of getting back to why certain fields have different representations and what makes things hard.

B

A second thing is that this implicitly assumes that the representation should be local. So if there are long range interactions, it can be harder to capture those in this sort of um this sort of model, and that's another active research area that I'll talk about.

B

There are many many many such local environment, fingerprints, so um there's reviews coming out all the time. This is one from goetheker. Just this year they compared um many body symmetry functions, this fchl representation from anatoly's group soap, descriptors, orbital, matrix acsf, which is the one I showed in the previous one.

B

All of these are very similar in terms of the idea, but are different ways of representing things or have different symmetry properties, I would say maybe the second most common of all of these is probably the set of soap descriptors, which is short for smooth overlap of atomic positions, lots and lots of different representations.

B

um This is another choice you have to make for every single one of these there's another choice of what should all of these magical numbers be and um how many different types of things should you include, so the problem gets a little bit overwhelming people have already started to compare accuracy for these different methods.

B

So this is a really nice paper from uh schweiping's group at ucsd, where they basically went through and for a a very common material science problem compared different types of descriptors and different types of potentials, both in terms of the computational cost and the error, and one interesting thing that came out of this was that these moment, tensor potentials, which are relatively new, seemed to do quite well. They were either more accurate or lower computational cost, depending on what you care about on the upper right graph.

B

um The other cool thing is that you can see how all these different methods have different accuracies for these sort of situations. This neural network potential is the same as the acsf I was showing before. So.

B

Those orange points are also pretty good before you get started in one of these areas, I would think about what is easy to implement, and I would also think about um what is the benchmark data set that is closest to where you're doing so that you don't have to try each one of these for your system and see which one is most accurate.

B

As I said, one of the downsides, this sort of local approach is that you don't get long-range forces, so the most common long-range force is electrostatic interactions. That is, if I have charge on my molecule, a charged atom interacting with another charged atom will scale as um one over r um squared, which is really scary. That's a very, very long range force.

B

So if you only look at your nearest neighbors, you can't get that sort of energy and right away you're going to be limited. This is going to lead to the same unique representation problems that people were asking about before.

B

So this is a a relatively hot area now trying to add in electrostatics and long-range forces. This is a paper that came out just a month or two ago by baylor. Basically, they implement two different neural networks with very similar symmetry functions, very similar representations, and they have a first step where they try and predict the electronegativity.

B

Then they solve for the equilibrium charge. They take those charges and put them into electrostatic molec electrostatic model. They take the same charges and use them as features and the neural network potentials and the total energy now has an electrostatic component that can be quite long range.

B

This is very new. It is models like this are helping to push these models into more complicated areas like electrochemical systems, where things are definitely definitely charged because we're putting in potential across the system.

B

Similar ideas have been done for small molecules, so, for example, um this was a work by michaela teriotti last year, basically looking at other ways of including long-range effects into these sort of interactions.

B

um The point I just want to get across is that if you know that you have long-range forces, you need to be aware of it and it's going to change the way that you represent your system. So if you have a charge system- and you just take a a local, um a local method- and you just assume it's going to work because of machine learning, you're- probably going to have a bad time.

B

The last thing I want to talk about with these local features is that some people have thought about how to improve the element scaling. So there is a representation called the weighted atom center symmetry function, where you basically add an additional weight on these symmetry functions. That depends on the atomic number um that allows you to have a fixed length, representation and scale.

B

The performance on that for integrating materials hasn't been awesome, but it does partially solve the element scaling problem. Another example is by um john kitchen in my department here at cmu. He basically showed that the same set of weights in the same network could actually be used for all of the different elements.

B

As long as you made the final energies, the nearly dependent on the final layer, you still have a problem of fp1 and fp2 and fp3 are scaling quadratically with the number of elements, but it reduces it by at least one one factor.

B

Ideas like this might help partially address some of the some of the issues with these methods and open up how applicable they are in other areas.

B

Okay, um one thing to keep in mind is that it has gotten really easy to make a neural network potential, and so um these are just the the ones that I know of off the top of my head and I included links for all of them if you're interested just alphabetical order, don't rewrite your own neural network potential from scratch, unless you're really sure that something is new, this area is moving really quickly.

B

You might as well share code, there's been a proliferation of methods, and it's getting a little bit crazy in this area.

B

Okay, moving on to something a little bit more complicated and a little bit closer to what tess was talking about last month. um Graph methods got really popular in about 2015, with this paper by ryan adams and elan. As for guzik, when he was at harvard, um the idea is basically to apply graph convolution networks to small molecules.

B

We usually assume for small molecules that there's a one-hot encoding. So this is um this is started from this paper. I am going to choose a representation for every one of these atoms as a binary one hot. If it's a hydrogen, the representation is zero.

B

One zero zero, if it's an oxygen at zero, zero one, zero, there's a whole host of methods that are all related, slightly different methods, but um same fundamental idea, same sort of inputs, um the same molecule net paper that I talked about earlier from vijaypanda's group does a really nice job of talking about how they're different or how they're similar.

B

If you're interested, I would definitely read those first. um Most of them are based on the idea that uh you're looking locally and applying convolution operations or you're passing messages around and then trying to collect the messages and predict final properties.

B

So there's a modeling question now on the representation side of how do you take a molecular structure and make it a graph? um This is something that is not obvious all the time. So the nodes are pretty obvious. Usually those are atomic nuclei, but it could also be a coarse.

B

Green subunit, like a whole methyl group could be a node edges, are usually either bonds or every atom within a radius are cut off and I'll talk a little bit about how those work node features could be something like a binary representation or an atomic number or some other property.

B

The weights could be distances or something else. We could include angles as additional complications, there's other things that we can put into edge features like what are the um what's the distance or what are the properties of the nodes, and ultimately, this graph representation comes down to a modeling decision. There's not one right way. If you look in the literature, there's all sorts of different ways that people are applying this.

B

So when you read one of these papers, it is not just enough to say it's message passing or not for the actual implementation. You also need to think about what is the actual graph that you're operating on, and how do I apply that to my system, so this gets back to some of the questions about periodic boundary conditions and representations that I was talking about earlier.

B

um One thing that you could do is you could look at your neighborhood and you could pull everything together and you can say I want to find everything nearby and I'm going to take all that information and everything that is within a certain distance or within a certain shell gets counted as a neighbor. That's the simplest approach, it's usually the fastest.

B

It works pretty well, um but one problem with this approach is that if you have two different element, types that have different atomic radii, it's not always super clear what you mean by a covalent bond or what the radius should be.

B

So we can go from a element, specific representation to something a little bit more general using an idea called a voronoi tessellation where we basically take the atomic structure. We create bounding boxes where the rules are. I am going to make a line based on the fact that it is equal distance between two points, so the top left blue line is equal to equidistant between the orange and the black.

B

The left, one is equidistant between the black and the red. Lower left is between the black and the green and the voronoi representation. Then is the area or the length of the polyhedra that forms? One of those little edges says something about how much interaction there is between those two atoms. So in this case the fact that there is a very large interaction between the black and the purple, that's represented by the fact that the line between the black and the purple is very long compared to the line between the black and the red.

B

This is also cool, because if you did the same thing and you went out a little farther, there would be identically zero.

B

Overlap between any of the other atoms in the system, so the idea of shielding is there. This is really a local system. It allows you to say only what are the nearby neighbors and not the ones that are neighbors and neighbors. So that's cool.

B

The downside is that these voronoi methods tend to be a little bit more expensive um depending on your system, and how many of these you have to do that can get a little bit slow and slow down your methods.

B

The second question is: what do you do with the periodic binding conditions? This is the same idea that I was talking about before for a small molecule. It's pretty easy. I just keep things as an undirected graph, no problem when there's periodic boundary conditions.

B

I can represent this in different ways. I need to take it. Take into account the fact that there's images of other atoms that I want to consider- and so I need to have multiple bonds.

B

In this example, the zero atom is bonded with the one atom and is also seeing the one atom one over and so in the adjacency matrix. It would not just be a one which would say that the two atoms are neighbors. It would be a two in that there's identically, two different interactions between those between those unique atoms in the system, and so this representation, I think, is most rigorously called a mixed multi-graph.

B

This is the one that we're using for most of our representations. Now again, it is a modeling decision for these graph methods, because these graph methods have done so well. There's been a lot of progress in this area and there's still a lot of competition to make these things more accurate, so one of the ones I wanted to highlight that I found especially impressive.

B

Is this model called dime net, which is basically taking the idea of message passing by allowing for directional message passing only at the same time, they also include spherical harmonics, as their bases set, and the end result is that this model is getting quite a bit more complicated than the ones that we saw before, but the performance is really a lot better than the previous graph models.

B

So, as far as I know, this dinette model is, I would say, state of the art for small molecules and maybe also for materials, but that's a little bit less clear.

A

Just a heads up on time about 15 minutes left.

B

Perfect: okay, thanks, okay, um I haven't talked a lot about material. So far, it's mostly been small molecules. If we want to apply this to materials, then the additional things that we want to consider are: how do we encode the element? Type, so a I would say, breakthrough in this area. Was this paper by jeff grossman's group called cgcnn?

B

It's a very simple convolutional method. I would say it's so simple that I maybe I shouldn't even call it a convolutional method, graph, convolution method, um but the the really key insight was to put node features in that were based on the elemental properties, so that you didn't have to learn every unique combination of elements.

B

um This really kick-started a huge effort. This paper in 2018 has already gotten a huge number of citations.

B

um There's a whole host of others that have tried to improve upon this um things like um papers called improve cgcnn from chris robertson's group same idea, but they improved the properties and they use ronin connectivity instead of real space distances.

B

Orbital field matrix representations from amir baratti for our monies group here at cmu very similar idea, but instead they have interaction, features based on the types of types of electrons in each of the types of elements.

B

This area is really making a lot of progress very quickly and it's helping these graph models extend their materials.

B

One final note for these graph methods is that up until a couple of years ago, if you wanted to implement one of these, you had to do it from scratch and pi torture. Tensorflow, there's now a whole library of pi torch, convolution operations called pi torch geometric.

B

A lot of the new ideas like dimenet, internet and others are getting implemented there, and that has, I think, made things a lot easier to compare and contrast. So, if you're looking for a starting point- and you know pi torch, this is my recommendation for how to start playing around with these representations.

B

The next type of representation I want to talk about that I think, is quite unique and I'm very excited about, but there haven't been very many examples of is orbital features, and the idea is that when I do a relatively cheap calculation, even a tight, binding one have a lot of information about what happens.

B

I don't just get the final energy. I also get things like the orbital occupant occupation for each of the electrons that I'm considering in the system and those occupations and the structure and the shapes and everything else are all extremely physical features that you can use as a representation.

B

So um the best example I've seen of this is by tom miller's group at caltech published this here called orbnet. They basically do a type binding calculation in order to get some interesting features.

B

Those features are then used to predict the difference between the type binding energy and a coupled cluster energy, and with that they're able to get much better accuracy than just convolution operations on their own.

B

This is really cool because it's taking advantage of other information from the models, the downside is, you have to do a type binding calculation. So, depending on your application, I would say either type binding is considered expensive or cheap. If you talk to the classical molecular mechanics people, they would say, type binding is ridiculous and it's way too slow.

B

If you talk to the couple of cluster people, they would say: oh type, binding is no problem. It's still four doesn't matter too cheaper than my normal calculations. I'm perfectly happy to do that either way. I like the way that they're thinking about this the idea that there's additional ways of representing these molecules, besides just the nodes and atoms, um there's actually electronic things that are coming from the from the calculations themselves, getting close to the end. um The next thing I want to talk about is real space convolutions.

B

The idea is basically borrowed from image classification. There are huge models from google and others on how to apply these very, very dense, um very, very deep neural networks to image problems, and the idea is basically if we can apply this to images. Why don't we try and apply it to molecules and materials as well? So that's great.

B

The main benefit is there's a huge amount of literature you can build off of the models are super fast, they're, pre-trained things you don't have to do it from scratch. Google and facebook, and others have already made this super easy to try.

B

The real question, then, is what is an image of a molecule? What is the image of a material, so three applications that I've seen um one by isaac, timberland's group at um the nrc in canada, um joshua benjio's group also in canada from last fall aj medford's group.

B

These are all different ways of thinking about how to make an image of a system, and then, after you have this image, usually based on some sort of gaussian density distribution, you apply a standard image, convolution method and predict the final property.

B

The idea is really interesting and the methods are fast to implement, but you often run into the same questions of that test was talking about. How do you encode things like rotational, invariance, and so a lot of the difficulty is. How do you augment your data? Enough or how do you enforce some other representation in order to make that possible so that those are the major limitations right now?

B

um As I said, tess has already talked about this. If you're interested in this idea, I would basically look at her work on these euclidean networks. It's really getting to the heart of how what should the symmetries be, and she even uses some image. Examples in her presentation.

B

Okay unit cells and lattice vectors are an additional complication that also need to be brought up if you want to design a periodic system. So it's not just enough to know where the atoms are connected.

B

If I wanted to predict this from scratch and give this as an input to a dft code, I would also need to encode things like the lattice constant of lattice angles or the the symmetry of the system, and so there's been some recent work, trying to actually encode that, as well as part of the representation by um yusung zhang antonio bonacici at mit and others.

B

I really like this work because we're able to encode the crystal structure itself and they're able to use this to design and produce new new molecules and new materials, but this is something you have to think about if you're designing crystals, this doesn't come up with the small molecule generative models, and this is one of the major issues with the representation that stop those generative models being used in materials.

B

Okay, with the last few minutes, I just want to talk a little bit about some automated methods to find and compare a lot of these different approaches and then talk a little bit about where I see some possible benefits.

B

So if you have to start from scratch and try all these representations every time, you're never going to make much progress unless you get lucky, and so um just like google and others are pushing things like auto ml for image recognition.

B

There have been some recent attempts to try and make these things a little more automated, so you can find the right representation without being an expert.

B

Two examples that I'm aware of are one unabove at lbl had a very nice paper. This year, talking about this auto map miner tool that will try a lot of different representations and a lot of different models and try and come with the best one. I found that it works very well for small data sets or simpler things.

B

It is usually not super competitive yet with the inorganic large data sets, but I think it will improve as their models get more complicated schrodinger, which is a commercial company, has a tool called auto qsar which they published on. You can read a little bit more about it same idea, but apply to small molecule featurization techniques.

B

I think tools like this are going to become more important, because right now, every time we try this with a new data set and type of representation, it's basically a phd level project, and so that that really slows things down. If that takes months or years every time you try a new, a new challenge, and finally, I just want to point out a couple of areas that I think are opportunities, um some of which we're working on some of which I really wish someone would come along and just solve to make my life easier.

B

So um the first one I want to point out is some of the most powerful methods in small molecules. Right now are based on natural language processing. As I talked about the thing, that's limiting that application to materials is that there's no grammar for crystals, so the first person to come along with something text-based whatever it is that encodes all of the interesting information about a crystal, in a way that you can then apply all of the standard, natural language processing techniques to generate new structures or whatever?

B

That would open up a whole new suite of methods and kick off a ton of work. I don't know how to do it. I think it's a hard problem. The fact that there's not a grammar is um says that the inorganic material people have already thought about this and been unsuccessful, but that would really be a step, change and open up a new area of machine learning for materials.

B

The graph methods, I think, are moving especially fast compared to all these others, because the graph machine learning community is very large and very active right now, so we can benefit from what they're doing this orbital idea is very cool. I haven't seen it applied to any materials, but I'm sure it's coming.

B

I think that would be super cool and we need to find ways of using more information from these representations and the final one is this idea of encoding unit cells and lattices.

B

There's been one demonstration that I talked about for materials. There hasn't been anything in surfaces or catalysis. I think there's going to be more in materials as well.

B

This is really limited how we generate new materials, and so I think, there's a lot of opportunity there to apply the same ideas to these other more interesting systems. Okay. um Finally, I just want to highlight um people who helped contribute to this, so javi and brandon um brandon's, a post-doc here at nursk on the nissa program in june. I'll help make some of the slides for this. So thanks, um javi brandon in june.

B

I also wanted to highlight kevin who's, a phd student in my group, working on machine learning, catalysis and he's planning to graduate this spring and is looking for positions so get in touch if you're interested.

B

I also wanted to highlight um four students who are applying for phd positions this year, since this phd application time so sudeesh, sarab, amish and richie are all awesome people who have been working on machine learning and materials, and so, if you're interested in some of these ideas or machine learning potentials or whatever feel free to shoot me an email and I'm happy to give you their info.

B

The community is pretty close knit and I'm really excited with how these things are going. There's been a lot of collaborative work, so um hopefully this leads to more collaboration in the future. So, with whatever time I have left, I am happy to answer some more questions.

B

um Otherwise, thanks for listening- and I am super excited to see so many people interested in deep learning and science.

A

Thanks a lot zach looks like some questions are coming in again. If you want me to read them, I can otherwise. You can see the q and a.

B

um Okay, so the first question is: uh what should the representation be for a classical md simulation? That's very large, like 20 000 water molecules in a box. um It's a great question um in the classical md codes. A lot of the work goes into long-range forces like leonard jones and electrostatics.

B

So if that is important, you need to be a little bit careful there. There are several different projects that I know of to try and implement neural network potentials into md codes like lamps, so for the short range contribution, basically the um the bond energies and the angular interactions. I don't see any reason that those shouldn't apply. um I find this very similar to reactive muscular dynamics potentials where um those have worked relatively well for long-time scale md. um I think people are trying to do it.

B

It's just very time intensive to take one of these machine learning, codes and interface it with something like lamps. So um the idea is good. I, if you're interested shoot me an email and I'm happy to point you in the direction of some people working on this. um It's just a very time intensive process because you have to be very careful with the code.

B

um The second question is on um uh the cutoff radius and um they mentioned that the the typical cutoff radii for classical md is 12 angstroms, um but we're using seven or eight or maybe even lower in these unit cells. So what is the impact of this? And what's the problem? um I think it's a great question.

B

um I think one thing you have to consider is that you have to include a lot of information in that 12 angstroms, so um you can get by with a simple blender jones potential as long as you know, every other neighbor nearby. In this case we have a little bit more information from these neural network potentials. It's not just a leonard jones. It also has some some wiggles and other things, and so the fact that there's correlations with what that function should look like and what happens.

B

One coordination level out means that we can sometimes get by with a shorter cutoff than you would use in md.

B

A lot of the systems that people are training on van der waals has not been super important. So, for most of the catalysis examples, um the dft codes themselves don't even include van der waals. So if I'm running rpbe calculations, I don't need a 12 angstrom cutoff radius, because the dft doesn't even have that in there as soon as you go to a system where van der waals is important, things, like you say, a large water box or larger unit cells or large molecules on a surface or less covalent interactions.

B

You probably need leonard jones and van der waals and you probably need a larger cutoff.

B

That's the direction where people are pushing and I think, there's going to be progress, but it's really come from the fact that most people have chosen specifically much simpler systems where you wouldn't have this problem. We all know it's an issue. um We just haven't gotten there yet.

A

All righty, I don't see any more questions at the moment. um I don't know if you need to run to another thing, just in case any more come in, but um I I do at least want to give them that it's 11..

A

uh Thank you very much zach for this fantastic presentation, great overview of a lot of the activities going on in this space, a lot of good stuff and good references to follow up on um and with that. Actually that brings us to the end of the deep learning for science 2020 program. So, just very briefly, I want to thank everyone who contributed and attended the webinars. I think we had a lot of really fantastic speakers.

A

Great material, the attendees were engaged and had a lot of great questions.

A

We do hope to be able to get back to an in-person event next year, but of course, stay tuned for announcements on that um one, more special thanks to mustafa, he wasn't able to join today, but he was really the one driving the organization of this event uh did the majority of the work, putting together the program and did a great job.

A

So uh thanks to mustafa and uh the rest of us well we'll be still on slack, so feel free to continue using that workspace to discuss the material or related deep learning for science topics. We hope to hope to see you next time.