National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 15 - Deep Learning for Molecular Engineering - Jennifer Wei

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

No we'll move on to some deep learning for chemistry. It's my pleasure by Jennifer today for chemistry, you work at Google, brain and she's, a PhD from Harvard. Thank you.

B

Hi, everyone um yeah so I'm here today to talk to you about some of the projects that I'd worked on during my PhD, which was focused on applying machine learning to various chemistry applications. So, by way of a very general introduction, some of the problems that we care about when we say that we're trying to find a molecule that will solve problems in chemistry.

B

It might look like the following: we're looking for a molecule that satisfies particular properties so, for example, and drug discovery we're interested in finding a molecule they'll fit in exactly this finding pocket inside of the protein, which might be your drug target. In the case of something like Flo batteries, we care about finding molecules that have the right reduction potential so that when they're cycled thousands, thousands tens of thousands times during this celebrity's lifetime its stable and it has the right route duction potential.

B

So how does the pipeline look like for vine? Finding these molecules? First, somebody proposes these molecules and they usually is their intuition or they use.

B

Maybe some rules and calculation to propose a set of molecules, and then there is another person, has a proposed synthesis for all the proposed molecules and then actually go and synthesize molecules, and once you synthesize the molecules, you have to characterize the molecules for the product to make sure that they have the properties that you wanted, as well as to make sure that the what the product that a synthesized is pure, and so you repeat this and propose new molecules and the cycle can take one to two months per cycle and this is really slow, especially when you consider the fact that chemical space is absurdly large.

B

It's very easy to make small tweaks to a molecule, just change the functional group to the side and you'll get a comment or explosion of moloch of molecule space. So it's very difficult to search for molecules that will solve particular problems if you do it in just a linear fashion, and so this is what God does motivated to think about ways of applying machine learning to chemical discovery.

B

Throughout this talk and to telling you about my project, I hope I can leave you with a couple of lessons that I've learned anyways about how to think about machine learning how to apply machine learning to chemistry problems in particular, I think it's really important to be able to frame both of the input, representation and the targets, as so something that machine learning can model. So you have to find a way to discretize your inputs and discretize the output to target output in order to be representative, all machine learning.

B

Obviously, the other big constraint is finding a data set. That's large enough to do what you want to cover the all the input space that your you care about, and, finally, as much as possible, try to apply scientific knowledge and scientific intuition into your models. It really does help the performance. Okay, so representing non regular inputs. Molecules can vary in this number of atoms.

B

They have a number of bonds they have, and this makes for a rather complicated representation problems compared to say images where you can always have the same number of pixels, and so how do we handle this for 40 or 50 years now? The drug discovery community has already thought of some solutions to this problem. One solution is to express this graph as a text screen, and so you see here that different fragments of the molecule correspond to different parts of the text ring and the opening and closing of these rings are also represented by numbers.

B

So there's a very grammar focused string, a text-based representation that you can use to represent molecules. Another representation is a local structure fingerprint. So what happens?

B

Is you'll capture a local environments around a particular atom and you'll store that the presence of that feature inside of a vector presentation and these vectors can are typically binary or they can also be accounting so they're, just it's a count of the number of features that you have like kind within the graph, the last representation which I won't go into that much but is gaining a lot of traction right now is the graph convolutional networks that I think you've heard a bit about today and basically how those work is it's similar to the fingerprint idea, except rather than having a discrete representation for each sub feature that you become encoded into this continuous representation?

B

That's local or lies around that nodes of the graph representation. If you're interested in learning more about general representations, I encourage you to check out this review written by my lab mate, Ben, okay, so the first project that I'll talk about is applying variational auto-encoders to machine learning discovery.

B

So the motivation for this is that, typically your machine learning, algorithms, you have input molecules you, trade, a machine learning algorithm predict you some properties that you care about and once you've trained this well, you can go back and iterate and generate new molecules that you can then feed through and actually test some of these in the lab, once you're satisfied with the with the molecules that you've gotten but the predictions that you've made for these molecules. But this can be rather slow right, like it might.

B

Take you some time to realize that after you synthesize it, oh there's an error in your prediction or oh there's some other error.

B

So it would be helpful if you could just generate the best molecules from the beginning, or at least generate the best of all molecules, as proposed by the model first and see whether you can predict those and so the way that this works then is we wanted a model where you can feed in both the molecules and the properties you feed it to machine learning algorithm and then it outputs that what it thinks are the best small molecules, and so this takes a couple of steps.

B

The first step really is to be able to compress the representation in molecular space down from 10 to the 60 to something more manageable. So the way we did this is by using a variational oughta encoder, so we use a variational encoder. The first part of it is a convolutional neural network taking in this text representation- and we end up compressing this representation down to an order of 100 dimensions.

B

Then we use a recurrent neural network to decode the latent representation to try to get the same output, so this is focusing on the reconstruction. It turns out, though, that there's no reason you need to stop at only using the auto encoder to encode and decode. The molecule like this is also a valid representation of a molecule.

B

Now it's a learned, representation of a molecule, so we then did was to use this latent representation to predict a target property that we cared about, and the idea here is that now that we have the smooth representation of molecule space, in addition to optimizing mo, we can associate this representation with a property and therefore we can optimize by optimizing on this property surface. We can also optimize in this latent space and the code to get whatever molecule.

B

So we did this with toy example. We did this with molecules from the zinc data set, which is this collection of drug molecules, and we sub sampled it so and took 250 thousand molecules the labels. The properties that we were tricked in here are all chemical, chemical, Kevin, phonetic, descriptors, and the reason for this is that it's possible to have values for all our molecules in our data set cheaply. So this is what the latent space looks like.

B

Roughly speaking, when you trading it with just the auto-encoder, I'm showing you, the coloring scheme is the value of the log key, which is a rough equivalent, roach measure of solubility of a molecule, and you can see that the organization of the lane space is very disordered versus if you do train it with the two-prong network that I showed you, the joint the property predicting autoencoder. You end up with a distribution of points such that the higher values are in one part of the lane.

B

Space and lower valleys are all the other part of the laden space, and indeed we we didn't use this too. For optimization, we selected molecules with a low percentile, usually in the in the bottom of 20%, of our objective value, and we were able to use this latent space from bryden's, with some optimization techniques to find new molecules that had really high property values according to our model.

B

So I think this work demonstrates that optimization is possible when you do have a large label data set and when the property that you want to predict are pretty smooth but I. Think there's some obvious caveats. In particular, we started a lot with trying to have the molecules generated be actually feasible.

B

In some cases, this texturing generated had grammar issues or the molecule that it produced when it was correct, was actually not synthesizable and there's a been a lot of work recently on finding ways to make ensure that the molecules that are generated are actually feasible either by applying a grammar rules like using a grammar variational encoder, to ensure that the spring output is correct or trying to generate molecules directly by building it up from up from fragments or even building it up from reactions. There's been a couple of papers in this game.

B

Another big issue is that this doesn't work well for small datasets yet, but there needs to be more work done in terms of finding ways to apply the learned representation in towards datasets that are much smaller than twenty. Fifty thousand molecules, either by using a transfer learning approach or by maybe using a semi supervised approach for cases where you can use some other property. That.

B

So, in the cases where, like in drug discovery, you might only have a couple of 100 labeled examples, maybe you can have some other property, that's correlated with that property that, where you can have many more data points, many more labelled examples and use that help your mom treating your model. The other big issue with this is that for properties that are not smooth, this becomes a complicated, optimization problem as well. So it may be difficult to use this towards optimizations of those problems.

B

The next topic I'd like to talk about, is synthesis of molecules so, as I said before, in addition to these molecules being really different, they all have very different synthetic pathways, and so the problem of synthesis is a problem of trying to identify which are the building blocks, that you need. What are the reactions that you need to eventually synthesize these molecules, so an organic chemist would look at a molecule like this and see that this is a particularly weak bråten.

B

That would be good to form the molecule across this bond and may propose that these molecules can be used to synthesize this molecule, but for most molecules that kind of secure about there usually are much larger, and so they might require twenty or forty steps, so even to just generate the synthetic route could take you roughly an hour, let alone synthesizing molecule could take a lot of time, and so this becomes a major bottleneck in this posle pipeline, like if you're trying to accelerate this process, then even it's not helpful that you can generate thousands of candidates in an hour.

B

If your synthesis pathway relies on somebody, you know to sit there and generate synthetic pathway. So the goal here, then, is to try to think of make a calculator make a tool for people to help make this process faster.

B

So what do I mean? So how do we frame the reaction, prediction problem or the synthesis problem? The way that machine learning algorithm can understand it? So when we think about reaction prediction, we're asking what? How do the combination of these two molecules, which I'll call reactants on this side of the arrow and this product interact so like how old what kind of product will they smell? These molecules form. Meanwhile, the retrosynthesis problems, kind of is the opposite problem. Where, like given a target molecule, what bill, what stuff should I take to synthesize that molecule?

B

The data sets focus mostly on reaction predictions for now. The data sets for reaction. Prediction are one, that's the USPTO data set. So this is a data set of reactions that somebody scraped from all from patent literature. Basically and another one is the reacts s collection, and this is not an open source reaction database, but it's been scraped from all the publications of reactions.

B

Again, going back to the question of how do we frame a reaction prediction in a way that machine learning algorithm can generate prediction? One way of doing this is by trying to predict a reaction template. So what I mean is that, given these reaction, molecule inputs I want to try to predict which template? Which reaction is most likely to occur? You can think of this roughly as when you play chess when they be a chess playing algorithms.

B

They also similarly need to predict which move was the most like was when to help them won the game. So this is a similar idea. We in my case here for this paper. We use of generate a set of reactions and so like constrained. The spades only consider 16 different kinds of reactions right. So then, the process of predicting these reactions is trying to as a multi.

B

Because of the classification tasks then trying to classify which reaction type is most likely to occur, and once you know which reaction type is most likely to occur, you can apply that transformation role to your reactants, the generative product. So this is one way of framing the reaction prediction question. Another way that has been a lot more popular in recent years is to predict the product directly, and so what I mean by this is well.

B

Here's one example done by a window engine and his collaborators at MIT, and so what happens here is they look at the molecule graph and they try to predict which atoms are most likely to react and once they select which molecule which atoms within the molecule most likely to react, they generate all the possible candidate structures that can form from the reaction of those particular atoms.

B

So what would happen if you combined all of the do you make all the possible connections between atoms or break all the possible connections between the atoms, and then they rank those outputs to generate the and then predict which candidate is the most likely to generate the product of the reactants so we're doing fairly well on this data set reaction, prediction is about 91. The last most recent paper is reported and Ike received 91 percent on the reaction prediction task and for synthesis generation, which is kind of the reverse problem.

B

We're getting about sixty three point: three percent of the correct correctly predicted reaction, synthetic routes in top ten matches, but they're also again caveats here. In that the data sets our reaction. Prediction are quite limited in particularly the scope of the reactions included in the data sets can be somewhat limiting and also there's a lot of noise within the data set. So, for example, many reactions might not include alcohol or water as one of their reactions or reagents, because to that community everybody already uses alcohol or water. So it's a sort of implied for them.

B

But when we go to train a neural network for reactions, then that might not that might be useful information that shouldn't be implied. That should be added. The other big issue is that a lot of reaction datasets, don't include very many negative reactions or examples. Those reactions that don't work- and this is a huge liability for when you're trying to predict reactions, because if you only see positive examples, you don't see any negative examples, then you won't actually know which ones of these kinds of reactions will or will not work. Okay.

B

So the last project that I would like to talk about today is this drug is one on characterization. So for this model we try to be taking input. Molecule I tried to predict its electron ionization mass spectra.

B

So what is mass spectrometry about? Well, let's say you have some molecule that you don't know what its identity is. One way you can. One process you can take to try to identify the molecule is to ionize it like an electron beam and then accelerated through an electromagnetic field to get spectra, and the spectra is then basically a histogram of the resulting ions from the ionization of molecules and it's sorted by its master charge ratio. And then what do you do with the spectrum?

B

One common practice is to look this spectrum up in library of mass collected mass spectra, so there's been library curated by NIST and Wiley, and you can then look up your query spectrum in this reference library, and the idea is that the top matching spectrum from the library should be represented should be your query like you would hope that this spectrum- it is not just the spectrum perfectly than this molecule that corresponds to that spectrum, see any of you'd identity of the molecule of your query molecule.

B

But this is an issue if your molecule is not in library right, like especially in our case, if you're synthesizing proposing a lot of molecules that aren't in the original that yeah our grand new to solve new problems, me they're, not in libraries, apparently so, in which case you can't use the library matching technique as given, or you can't use existing libraries to help identify whether or not this molecule is in you, it is correct.

B

So, as a result, what we prove, what we're saying is that I, what you'd want to do is to expand the existing libraries to include new molecules. So how do you expand the new existing library? So, and so the idea is that, if you want to expand the existing libraries, you would be able to find matches in some matches in the Augmented portion of the data set of life in the Augmented part of the library so thereby helping to cut increase the coverage of the existing libraries. So how do we get new spectra?

B

One way is to just make the molecules and then generate the spectra, but as you can magic costly process- and it might be difficult to tell how pure the sample is, another method is to use quantum mechanics to simulate the fragmentation events, and so there's been pretty good a success with this, but because your chance to model the fragmentation events, it becomes a very costly calculation and can take up the 10 minutes per molecule.

B

There's been efforts by the Wishard lab to predict the fragmentation and behavior using machine learning. So it's a interesting approach. Actually they look at the molecular graph and they try to predict the fragmentation probability across each bond and then use that to aggregate the results into a spectra. And this is a lot faster than quantum mechanics predictions, but it still takes a while for some of the larger molecules, since it needs to consider all the bonds within a molecule yeah. The problem is time, I'm, mostly talking about this.

B

On the basis of assuming that you want to generate predictions of spectra for thousands of molecules within an hour, yeah I mean it definitely depends on your constraints. Right, like I mean, if you have the time to generate spectra using or to predict spectra using quantum mechanics for all of the molecules in your data set, then I think that's probably still the way to go, but it's just.

B

If you want to generate these quickly, then maybe like you can imagine that in some cases you want to generate spectra for a million molecules, and so, if you need a million molecules by 10 minutes, it's going to be really long time. Perhaps you would like to generate them more quickly with a rough estimate, it just depends on what to what your goal is all right. So what we propose here, then, is to try to predict the spectra directly.

B

So, instead of we have an end-to-end prediction process where we take in the incoming fingerprint representation and try to predict the spectra, and so here, in our case we represent the spectra as a multi-dimensional regression task. So we consider all bins from one to a thousand and try to predict the intensity of each of those bins.

B

So I want to take a moment here to talk about. Let me sow the data set for this prediction is the we use the NIST library itself for the prediction task, there's about 250,000 spectra in that prediction task, and it also comes with this collection of 30,000, replicate spectra, and so these recommended spectra are interesting and that these molecules, the spectra from these molecules, were considered too noisy.

B

You can see part of the main library collection, but that's just perfect for us, because we consider them as a model for the spectra that you might receive from an actual Curie sample.

B

So our setup, then, for evaluating the model is to choose molecules from the replicates library and match them against the library that contains predicted spectra both for the replicate molecules and for some other molecules in the main library, and then we try to find the match for the Curie spectrum in the library and the metric that we want to focus on most is that we focus on doses whether or not the correct match came out in the top-ten match results from the library.

B

So I'd like to take a moment then, to talk about the importance of accounting for physical phenomena in your models, so in mass spectrometry, uh it's you get to ions right, it's possible to get both a small ion that corresponds to this. The actual fragment that's broken off. So in this case we have a methyl group, that's breaking off and this if it gets a positive charge after the ionization event, then you have a peak at 15.

B

But if the opposite happens is the positive charge ends up on the larger fragment, then you get a peak at in this case. It's 58 minus 15, which is 43. Now what happens if I consider the same event for a larger molecule for a larger molecule than your molecular mass is 72, and if the positive charge goes on the smaller fragment, then you still have a mass at 15.

B

But if the larger, if you're the fragment the positive charge, goes on the larger fragment now, then this is the same fragmentation event, but now the occurs at 57, just because the molecular mass of the molecule is different. So if you're interested, you can read more about the way that we applied this technique and the model, but basically we had the model account for the molecular mass, but you gave them all the fuel, the molecular mass in the prediction process to help it make predictions about each pin, and this has had a significant results.

B

So what I'm, showing you here? This is the baseline prediction. So, if you're using the original reference library to perform this library matching, then this is the best possible performance you could receive if you use a linear regression model instead of the full multi-layer perceptrons, all I showed you before you get about 4 percent accuracy by including this molecular mass into the prediction. You can get 20/20 points better prediction prediction, and this is just on the linear regression model.

B

um If you consider both modes, if you consider both counting from 0 and counting from the molecular mass and you're free to charge for the the spectral prediction you get about, 10 points, improvement and the same thing follows for using the full multi-layer, perceptron mall, so yeah we are able to achieve pretty high accuracy with significantly faster runtime using this approach.

B

So the extensions of this would be to try to consider this for smaller classes of molecules. But again, yes, we, you have to make sure that you have enough examples to represent the spectra in this case, then, you can also imagine using this predicts other kinds of spectra. So yes to go back to them.

B

The points at the beginning of lessons that I've learned about applying machine learning I think it's really important to consider in my case, that consider the representations both the molecules and of the problems that we want to frame that we want to address in humanly the key to being able to use machine learning for science is to be able to find ways of representing your inputs and representing your outputs in ways that are vectorized, that you can then use machine learning out within the predict I also.

B

Obviously, it's also important to find a reasonably large data set for that covers. The range of inputs that you care about, the definitions are reasonably large. Well, do vary for different applications and finally, as demonstrated by the last project, it's really important to apply scientific knowledge and apply intuitions that you already know it's.

B

So you need a domain expertise to really have successful models in this husk with that, I would like to thank my advisor or lon, and members of this lab, as well as the brain members of the group ranking that had helped me with the last project and thank you all for your attention.