South Big Data Hub Materials and Advanced Manufacturing Workshop, 28 Feb 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Data Challenges and Opportunities for Next Generation Materials Innovation Part 6

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I would like to bring up dr. tura Bookman. He is from the loss and los alamos national labs and he has been focused on the theory and modeling and computation for proposed matter radiation and in extreme, so the mic, I think, is attached to the computer. So we always get this change.

A

He's a signature faculty which has recently been awarded a CD 0 by the DIA BOE. He has led the computational materials for eczema x and for the National Labs co-design effort leads leads all of their lnl on material of informatics. So this is another she's, a fellow of the american physical society, recipient of japan's society, of promotion and science award in 2010, and also a fellow for outstanding research in science and engineering at Los, Alamos, National Lab. So today, I think he's talking about data challenges and opportunities for the next generation of materials. Innovation.

A

And we'll get the implements projectors to always work.

B

Thank you very much indeed, and so let me try to tell you the scope of what I will focus on, so there are essentially two issues that I will choose. Two sets of problems that I will tell you something about so the first one relates to innovation and experimental material science, because in many ways that really will be the source of the data, the increasing sizes of data that we will have to wrestle with, and that then we'll take me to Murray matter, radiation and extreme.

B

So this is it's very revolutionary concept that Los Alamos is working on. So this is a decadal challenge in that this is an in situ facility with an ex fel, a free electron laser designed to actually look at materials behavior. In other words, if you've got a material, it's being subjected to something or the other shock, you want to be able to see inside it to actually see what's happening. So imagine the amount of data that you're going to collect.

B

So that's that's the first part of my talk and then once you have all the nice data. Clearly, the challenge is to learn from the data, and so this will bring me to the whole aspect, to materials, informatics and design that we've been looking at in so far as how do we do statistical design? How do we do informatics and what? What is it that we can learn I'll show you that uncertainties are very, very important. They really allow us to explore the search space.

B

So it's all it's all a matter of exploitation and exploration, and that's really how one can find materials with targeted response. So I'll give you some examples: I'll focus on an alloy system, nickel, titanium I'll, show you how we can find nickel titanium alloys with very, very small crumble dissipation by using this strategy. There are some other examples that I have these electrics light emitting diodes. I won't have time to tell you about those okay. So that's really the scope of what I want to tell you about.

B

Okay, so the first question that arises is: do we really have a big data problem in material science? If you talk to experimentalist, my friends, tell me, look I basically have you know. Ten data sets I, have 10 phase diagrams of 10 samples. I have synthesized 10, solid solutions. This is not a big data problem right, and so what you have to do is to start from there. However, if we look at what's been happening, this is really the landscape. As far as the data is concerned, what you have is the in many ways.

B

The closest to this to this to a lot of these activities is the LHC, the Large Hadron Collider. That really is its closest, because that's also an experimental facility and in many ways that's really. The benchmark. They've got a very, very nice data portal. They've got a lot of experience, and if you talk to them, they will tell you that the challenge really is in how you I mean they process a lot of data. But processing is one aspect. Another aspect is collecting the data. Okay, so what they see are hundreds of millions of collisions.

B

They use certain criteria, dirty criteria to essentially look at one event in a thousand, and it's really from that one L event in a thousand that they make certain conclusions. So this in many ways provide some measure of what the landscape is like. It's estimated that in the year by 2025 in the next upgrade that they actually foresee the amount of data will increase by 24 right now. For example, the velocity of the data approaches something like twenty five gigabytes per second. So that's the scope. That's really the big data problem now in material science.

B

Here is sort of my rendition of the materials data landscape. The kind of problem that I will tell you about in so far as being able to do informatics in design is that little speck right there, one gigabyte got just basically a small amount of data. I'll tell you something about ApS HED M, so this is high-resolution electron, diffraction microscopy of the order of a terabyte okay. If you take a lot of data, several samples, it's sort of 10, 10, terabytes, ebsd you're, not really using a light source there.

B

It's all basically done on site of the order of two gigabytes per sample. Really the future in terms of collecting data lies in new experimental facilities, light sources. Okay, so that's really where the action is so 80s is an example currently. So this is really the current state of the art, but then lcls. This is at flack. So this is the XA fel facility, coherent, x-ray diffraction, where you can actually get beautiful spatially as well as time-resolved data. That really is sort of of the order right now, five terabytes per beam time.

B

So this is some guy going. You've got be in time; he basically takes one sample essentially can sort of take collect data. You know for five different fields or stresses and I'll give you an example of what. Actually you can learn by doing that LCLs too. So that's the upgrade in the next five years, 100 terabytes per beam time. This is Marie, so that's the facility I'm going to tell you something about that really is of your sort of tenfold okay.

B

So that's in the next decade, / beam time, so this, in our view, is sort of the landscape. Now this is by the way, a logarithmic scale. So that's why you see this sort of slight discrepancy in the sizes of these things. So that's la see there there's Google there, so this gives you the scope of where material science life. So, yes, we are going to be moving to an area where we will have large amounts of data, but the contention is that a lot of this data will come from experimental facility such as these.

B

B

Let me tell you a little bit about sort of the current state of the art in terms of whether what sort of data are we dealing with and what are we learning so high throughput calculations, as well as how to put experiments is a source of data right now, and the kind of data sets you're dealing with our glue or of gigabytes, and let me then go on to the terabyte, so that's a big jump in current in today's current facilities, so this is sort of the high throughput calculations, so this is what's the price sort of does very well citroen, in my view, so the whole idea is you collect data of a lot of different materials and you're interested in all those different properties.

B

This I've you very much as a static data set. This is really not dynamic in any way. Okay, you may add one material to materials.

B

You know every week to this data set, but it's essentially a stationary data set. You've got something: for example, o qm d. You've got of the order of 300 for 100,000 compounds, they're, all sort of part of ICS be anyways, and so that's well, but it's a very useful thing to do, because you can learn something from the data and the way you learn from this data is really by screening. Okay.

B

So here the emphasis is on generating data and then screening to learn something as this thing from another strategy that I'll tell you something about where I actively ask the question: what are the next experiments? I need to be able to do to find a material with a targeted property? Okay. So that's high throughput calculations, but high throughput measurements are also very important and so there's some very beautiful work being done in this area by by ichiro kikuchi in particular, and so here the idea is that you may have a parameter space.

B

Okay, I bet you want to span and get enough data on, and then you want to hone in on the region that is of interest. This is a very good way to screen it's a first cut and then you go and typically what what's done is that you essentially have some sputtering guns. You may have three species, sputtering guns. You will have a thin film and its natural to do that and its rapid rapid sort of characterization. So when 11 zap you can actually get out, you can do diffraction get out the lattice parameters.

B

You know the lattice parameters and you can say something about the elastic compatibility and what have you? Okay, spectroscopic measurements as well, but you can also do that. So this is very nice. But again the data sets are of the order of gigabytes.

B

But then you have a big jump, I jump in that the data sets now are of the order of terabytes tens to hundreds of terabytes, and here the whole idea is, and it's a challenging problem. What you're doing is reconstructing microstructure okay, so you basically shine photons the light source to a given layer and by layer by layer. You build this microstructure reconstruct the microstructure, it's very, very slow.

B

The analysis, as I will show you is brute force, takes a lot of effort, and but if it's a sports, very beautiful, because it tells you a lot about the crystal orientations and then imagine being able to do this in situ, where you put a load on and you make it be formed, and you see how the deformation progresses.

B

If I look, if I show you the work flow associated with that problem, it shows you that complexity that you're dealing with you know you've got 360 angles. This is only 50 layers. You got three distances, far field near field somewhere in between 54,000 measurements, okay, obvious diffraction patterns. So that's your data set and then, of course, what you have to do is you have to calibrate the model. You've got to do some forward.

B

Modeling here, you've got to calibrate the model and then you've got to be able to in the crystal orientations that correspond to the diffraction pattern that you have. So this is a very time consuming problem, so some of the informatics tools that I will tell you something about we're actually using those to alleviate some of the bottlenecks and the analysis we want to be able to show that we can actually speed this process up, rather than doing it the way it's currently being done by brute force.

B

Okay, in the spirit of this workshop, this is a really beautiful example. This is cutting edge, so we want to be able to use a light source to precisely in situ monitor, be additive manufacturing process. Okay, so let me show you the movie here, so this is laser welding and.

B

You should see it coming so the wheel is moving. You've got a wheel, it's moving around at a certain rate and you're laying down material okay. So there is a torch there. That's laying down material and what you've got is a light source that is going to interrogate what's happening to the material. It's all in fits you. So what you're doing is so this is the sort of deposition you're, just just showing you. So this is the wheel, that's moving. They should have been of something at the bottom didn't come out.

B

Okay, that's fine! So so this is sitting on a wheel, that's actually moving and what you what this is. The this is a torch. So what you're doing is you've got a diffraction spot here. You've got another diffraction spot here, a diffraction spot here. These are the diffraction patterns associated with the would be with the process, and so you can see here what's happening. Is it's essentially molten liquid?

B

It's very diffused, as you go a little bit here, you're starting to see some of the peaks, the signatures of the crystallization process, and then thereafter you see well-defined peaks, okay. So it's in situ monitoring of all three of the diffraction patterns from which you can infer things like residual stresses and so forth. So this is a nice example of what we can do and clearly the data sets are fairly large, their of the order of hundreds of terabytes- and this is just being done right now by my colleague, Don Brown, at Los, Alamos. Ok!

B

So now we come to what is what are the sort of next generation facility is going to look like, so we just saw that you want to learn about processing and how the processing effects structure, which is what you learn, is by doing the diffraction. Clearly, the next step is going from the process to a more product based set up, whereby you really want to learn about the properties that you're interested in in controlling, and so this is where Marie comes in. So this is Marie, which is a facility.

B

It's of the order of two and a half billion dollars over the next decade. That Los Alamos wants to build to be able to monitor materials behavior in situ, ok and it uses a free electron laser accessory here. Why? Because really, it's only the coherence of the x-rays which give you the brilliance and the high repetition rate that will allow you to have the kind of time resolution that you need to be able to monitor, what's happening as a function of time. We imagine snapshots at picosecond nano second intervals. You can control that over time.

B

Ok, so you can sort of in one millisecond. You can sort of get a whole bunch of different snapshots, separated by say, 100 picoseconds, that's what you want to be able to do to monitor. Well, you know what's happening, how the material is behaving, how it's transforming etc- and let's say it's a multi probe. So you've got information at different scales.

B

Ok, so you've got protons, you got x-rays electrons, and so you can sort of take a continuum image using protons and the x fe l will give you a lot of fine structure, so multi probe through to be able to do that. Let me tell you now what we can actually do today, so lcls a track is an example of a free electron laser. Now, it's very, very good for thin films, molecules, etc.

B

You know: that's why it has. It has had so much success, but it's a coherent source. What Murray will do is Murray will have the ability to do bulk samples. It's a much higher energy x-ray 42 kv, but this is an example of what we can do now, it's really seeing inside. So it's a nanoparticle.

B

It's about 19 nanometers and it's a ferroelectric, ok ferroelectrics are very important because you're dealing with polarization, switching they're, very important for fe Rams, ferroelectric RAM, and what you want to do is you want hide large, high density, ok, so the way you get high density is by taking nanoparticles and essentially building a raise, and so the idea is to study a nanoparticle like this now way, because its polarization it's a vortex. So this is an example of a vortex. It's got a dislocation line.

B

It's got a core, that's like a dislocation line in the inside, so you're, seeing a sort of vortex, nanoparticle being image, and now you can take slices. So these are slices inside this guy. Ok, and you can see that under the action of a field, this vortex Center actually is sort of you know, shifting in the medium itself. So this is beautiful, because I can exactly see inside its 3d imaging and you've got all the beautiful data. This is what we can do now. This requires about this.

B

Will this sort of generates about one and a half terabytes of data per sample? So so so this is the state of the art, ok, and so, where we're going is so we've been working on Murray at least I've been working on Murray since 2008 we've got it to the point where the critical decision zero has been has been approved, and it's now going through. You know, there's a whole process when you build facilities has to go through CD, 1, CD, 2, etc, and it's going to be on the Mesa. It's in fact really become.

B

One is going to use the proton linac at at Los Alamos right now and then build around it. So this is the new facility to come, which in many ways will then the sort of replace lands which we had for the last 40 years. It's a user facility.

B

So that's where we're going so in terms of data we're not, whereas now we basically get one visor plot. So that's a velocity profile when you do a shock, experiment! Okay, you know you basically get one every few days with Murray, we will be able to get hundreds of these okay and that's where the large data starts to enter in this game. So that's what we're heading towards and, of course, the integration. This co-design loop is very, very critical.

B

Okay, so now that we've got all this data, you want to do something more. You want to learn right. So my first point relates to: how do we accelerate the discovery process by guiding experiments towards finding materials with targeted properties?

B

So that's what I will address and then I will just show you how we can use very similar methods to look at data from facilities- okay, like, for example, aps and here right, and so you know, the kind of data that I'm talking about here is very, very small, and so the best way to sort of show you how we've been sort of doing this is by an example. So here's my example I want to find a alloy happens to be nickel, titanium alloy with the lowest hysteresis thermal thermal hysteresis. Okay.

B

So now, as you probably know, nickel titanium is a shape-memory alloy, so that simply means that I essentially prepare it in some shape, give it some shape. I I sort of look at it in the martensite phase, low symmetry phase, and then I can deform it to my heart's content. But when I then heat it up across the transition, the structural transition, it recovers the shape that was given to it. It recovers the screens.

B

Okay, the way you monitor, that is through differential scanning calorimetry, and so you basically look at the heat flow. So you heat, you get a Heaton and you cool you get essentially a peak. The hysteresis is the interval between those two right so for a material. What you! What you want is to minimize that hysteresis. Why do you want to minimize that it's theresa's, because that's what affects fatigue and you want something that can go through many many cycles without fatigue? That's your target! That's what you want.

B

How are you going to do this so so here, for example, is nickel, titanium, okay, which is which is used in industry a lot, and you can see that the spread is of the order of 25 k, okay with cycles and also the interval. This delta T is also of the order of 25 30 k, so our strategy was ok, we're looking for a multi-component alloy with very very small sum of hysteresis, and so our domain knowledge told us that we're going to restrict ourselves to this family.

B

We have a good idea as to what copper does what palladium does and what iron does in terms of how it affects the transition temperature, etc. Okay, so there was a lot of work that was done on sort of you know on alloys such as this, with varying amounts of x y&z.

B

So we had 22 samples- and this was already done in the literature, but we set out to do it ourselves, so we did the experiments ourselves in a very controlled fashion. The processing was really control. The same guys did experiment and measured the terminus Teresa's this delta T.

B

Now so the problem is very simple: I want to know what is XY and Z, which will minimize the hysteresis. That's my problem. Okay and clearly, the space is very large, so our experimental friends had the ability to control the composition, 2.1 percent. So if I use that the search space, the number of possibilities of X whines ease of the order of 800 k, so now what you're doing is you're looking. You know, you've got this vast search space and you want to find which what particular composition is going to minimize the thermal hysteresis.

B

That's the problem, and how are you going to do it? The key point is that uncertainties are very, very important. They will allow us to in many ways search that vast search space. So the strategy that we came up with is this. We essentially starts with our compositions. We identified the material descriptors or the features, and here domain knowledge was very important. We know if we knew from the literature that things like the valence electron number per atom is important. It affects the thermal hysteresis. It affects the transition temperatures, a lot of work published atomic.

B

The radii of the of the of the chemistry the species is also important, so we've essentially down selected a set of features. We didn't want the featureless to be too large because that really explodes the sort of degree of difficulty in terms of adamant high dimensionality of the problem. You want it to be small, yet you wanted to it to be able to say something, then what we did was to essentially do what everybody else does you do inference? So this is when people talk about materials informatics, this is really what they mean.

B

I'm going to do some kind of regression. Okay, you can use your favorite tool box off the web scikit-learn and you will do inference. That's what I mean there, but the key point that we realized was that it's really the design, that's critical. How do we choose the next experiment, the next sort of experiment that has to be done? Okay, this is not a one-shot deal. I make a prediction here: it is good for you, you really have to iterate, so you make you make certain predictions.

B

You suggest certain alloys for the experimentalist to do, and we suggested for so that in itself is a nice informatics problem. What are the best for that? You should suggest, and then the experimentalist goes and makes them. We then put that back it augments the data set the training data and we keep going. This is how we iterated to actually come up with the solution to this problem. Okay, now before I give you the solution to this problem. You know there I want to make this point.

B

Come back to this point, so inference is really not adequate by itself. You really need to explore. So, let's see what of what is when doing when one does influence regression. Ok, you're in you've got it you've got some data set somebody's given you and what your empirically doing is constructing some function. F of X can be least square. For example. You know you all done be squares right, so so on, though, on the left there I have a plot of exactly something like that where I'm showing you for the 22 compounds.

B

That are that I showed you. I have the predicted delta T against the measure delta T, and you feel that's pretty good. I already have a nice model and really what I'm looking for is small delta T. So what may occur to you is that hey I have an outlier. You see right there, large uncertainty and anyway I'm looking for a small delta T, that's not so important. I don't want to sample their your.

B

You will be tempted to basically throw that away bad okay, because there's a large uncertainty associated with it and you don't know, what's going to happen in the next step me tell you what actually happens in the next step we sort of we predicted for and so the one this one still allows for a small delta T. It still allows for it, but I, don't know what the result is. So we went in and and and and synthesized the compound three of them very nicely made the model, but the fourth one.

B

Essentially we be predictably measured. One was quite large okay, so that tells you in subsequently that that's not going to be important. But a priori, you don't know that what this is telling you is that there is a landscape in feature space which has local minima, and it's very important for you to explore this to be able to get the sort of best global minimum.

B

If you can and you mustn't pro stuff away because then you're not exploring you're, not getting the best results, it is suboptimal and we can actually show that because what we did was to give ourselves a pest problem. I have a data set. This happened to be max phases of 220 compounds and then I basically asked myself and I know the elastic moduli I asked myself. I want the compound with the largest elastic modulus. Let's do it, and so this is number of new measurements against the initial number of measurements in your training data.

B

Now, if all you do is randomly it's bad, okay, so nobody likes to do things.

A

B

But if you do them using inference what usually people do in materials informatics, then you can see that what you will get. So that's pure exploitation. You will get something like this, but the best result is when you actually do statistical design. So you can see it doesn't matter which strategy you use in statistical design. They all work reasonably well and they give you the best results within 35 new measurements. I can get the best. I can get the compound with the best modulus okay.

B

So that basically brings me to this slide, which really is not new industry has really known about this for a long time and the Operations community has known about this is for a long time. It's really a matter of using uncertainties to sort of balance, the trade-off between exploration and exploitation.

B

This happens to be a gaussian process model, and what you see here are the data points where you don't have uncertainty. You know the stuff, but where you don't have data points, you have these footballs of uncertainty and that's where you need to go and explore. Okay, very important and, as I said, a lot of these ideas are not new they've been used in the aerospace industry. Also in the auto industry.

B

They come the classic ideas that go back to Howard and Kushner 30 40 years ago on the value of information, and so they recognize that what's important, if you want to, if you have complex calculations that are going to take a look many many days, you really need to choose the best infill points. You really need to address the issue of what are the best response surfaces.

B

So a lot of this goes under the heading of surrogate based modeling, and all we did was to take those lessons from these guys and actually implement them on materials data sets. So this is the result that we got from our study. The alloy that we found, which had the smallest form of dissipation, is right. There I, certainly wouldn't have been able to find that otherwise and we got it in the sixth iteration.

B

So in our actual exercise we went through nine loops. We made 36 predictions because you were giving for each time. So we come. We synthesize 36 compounds, 14 of them were better than the best in our training set, and so the p-value is very small. There's no way that we could have found this compound on a random basis. If you want to know how good it is.

B

This shows you that the shift in in in the temperature and in delta T you over 60 cycles is very, very small compared to something like nickel titanium, and one of the things to point out is that our compound is very competitive in the in the landscape of compounds but notice that the the transition temperature is also in the right window. Now we didn't design for that.

B

That was fortuitous, but it shows you that the right way to do these things is through multi-objective, optimization, ok, so this strategy, we have subsequently been using to address the problem of that the facilities care about the whole problem of reconstruction. Ok, where I want to sort of choose very very fast. I want to get very, very fast the orientations of crystal orientations which will match the detector pattern, because that's what you care about in the microstructure I want. The crystal orientations- and so we've actually implemented this and it works very, very competitively.

B

Let me not, mr just to show you how we actually do it, but let me not focus on that. Let me summarize so.

B

The big warning here is the no free lunch here. Ok, materials informatics is really fraught with a lot of issues and there's a very famous theorem called the no free lunch theorem, which basically says that there is no universal optimizer, something that I do on a given. Data set a model that I come up with for a given data, set. There's no assurance that that's going to work on a slightly different data set; ok, no assurance at all. So you have to be exceedingly careful this.

B

There are no results to guide you here in classification for binary classification. You have a result that can actually guide you, but here there are no results. You really have to be very, very careful things like when you have small data sets things like cross-validation. Don't work very well. The bioinformatics people really know this well, because they've got few patients and they've got. You know thousands of genes, a very large feature space and that's fraught with a lot of difficulties, and so you have to use these methods very carefully.

B

But one thing we have found is that the design, this exploration, exploitation start strategy really in some sense make makes amends for the lack of an adequate inference model. That's a very interesting thing. Usually people just want a good regression model. We've discovered that the designer limit actually is quite forgiving of the paucity of the inference model.

B

You know what I've talked about is a data-driven approach, but I think that's not adequate. We really need to bring in theory. We need to bring in theory and relationships, constitutive relationship scaling relationships to constrain the search base and how we do that is outstanding challenge, so I think I think it's data-driven, plus knowledge that should give better predictions rather than just data driven by itself.

B

Of course, I have left out everything to do with exascale computing. Clearly, the high performance computing angle is an important one that needs to be incorporated in all of this as well. But you know there are a lot of synergies there as well, but thank you very much.