National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 25 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 10 - Hidden Physics Models - Maziar Raissi

Description

More about this lecture: https://dl4sci-school.lbl.gov/maziar-raissi
Lecture slides: https://drive.google.com/file/d/1pfPs-ll_ffq7SYMZVWPISnfwTG6oLJHC/view?usp=sharing
The Deep Learning for Science School: https://dl4sci-school.lbl.gov/

A

Yes, so today I'm going to be talking about hidden physics, models.

A

A little bit about myself, I'm masia raici, I'm an assistant professor department of applied mathematics, university of colorado, boulder, that's my email address, so please feel free to send me emails. Even if you have questions.

B

A

And I work at the intersection of probabilistic machine learning, data-driven scientific computing and deep learning.

A

Everything started at brown: we had a project funded by darpa and the idea was that we wanted to design super cavitating hydrofoils. Basically, these objects here so that the.

A

So that the vessel can go as fast as possible, basically the vessel is now flying in water. As you go as fast as possible, you are going to create vapor around the hydrofoil.

A

And uh in simple terms, the hydrofoil is gonna change into an airfoil, and it's gonna help that vessel to lift from the water and then start flying.

A

Everything ended with the project that was, external flows, ended up with an internal flow. This is an aneurysm of a real patient. The geometry is real. The flow that you are seeing is simulated.

A

And uh you can imagine that you are doing 4d mri, so you're. We are trying to simulate for the mri basically space and time imaging and it's common practice.

A

People inject dye, doctors inject dye in our arteries and then they try to visualize the flow. But as you can see, the flow on the right is just qualitative. It's gonna give us qualitative information about what's happening inside our aneurysm or somebody's aneurysm.

A

It's not gonna tell us what is the shear stress on the wall of the aneurysm? It's not going to tell us what is the velocity or the pressure.

A

From that that's going to be an inverse problem, it's going to be a very complicated inverse problem. We want to reconstruct what the flow looks like that's the exact dynamics, because we are simulating it. So we know how the exact dynamics is going to look like these are the stream lines and the stream lines are getting color coded by the pressure and that's what the algorithm learns.

A

So I'm gonna go into more details of the methodology that we used to accomplish these two projects.

A

For one of them we are using gaussian processes for the external flow for the internal flow. We are using neural networks and deep learning.

A

Okay, let's get started.

A

What am I not going to be talking about today? Are these topics? I'm not going to be talking about image classification, where you have an image, it's on high dime. It's a high dimensional object. The image, because you're gonna have a lot of pixels and the pixels are basically your dimensions and every pixel has three channels: red green blue. So that's a very high dimensional object.

A

That's the input and the output is, for instance, the class.

A

And your deep learning framework has to say has to look at an image and say whether that's an airplane, whether that's an automobile, whether that's a bird etc. So that's a classification problem, I'm not going to be talking about that.

A

I'm not going to be talking about translation and natural language processing.

A

I'm not going to be talking about generative adversarial, neural networks, but what is the message? Why am I, including this slide.

A

The message is that the data is gonna us what model we need to use.

A

So we need to let the data speak. If you see that convolutional neural networks are now the dominant choice in computer vision, there is a reason for that, because convolutional neural networks are gonna, be locally translational in variants.

A

So, basically, if you take this airplane shift it to the left or to the right or up and down, it's gonna still remain an airplane.

A

And they're stationary, basically, you can have a filter that you can slide it over your image.

A

So, even in our images, there is physics and the reason convolutional neural networks are working.

A

So greatly is because of.

A

Their smart design, a similar story, happens when you do sequence to tr to sequence, modeling.

A

For instance, if you have a translation task, we know that the words are appearing in sequences.

A

The first word is gonna appear. First, the second one is up here is gonna appear next etc. So there is a sequential nature.

A

To our language and that's why people started using recurrent neural networks and then they move to lstms because of vanishing gradient problems and now recently they're using transformers and attention models, and the reason for that is that attention models are really good at sitting on the gpu.

A

So there is always this trade-off of software design and hardware design. They go hand in hand.

A

So our algorithms are also respecting our physical constraint when it comes to.

A

Gpus and the other one is our generative adversarial neural networks. Whenever you have a high dimensional data set, and you want to learn the underlying distribution.

A

And you don't know what type of a likelihood you want to use.

A

Generative adversarial neural networks are really powerful when it comes to that. But the message, the take-home message from this slide is that the data is going to dictate the model.

A

A

We don't have much data, so we need to live with small data.

A

A

If we wanted to do the design optimization of the super cavitating hydrofoil.

A

And we wanted to connect real data, then you would need to build this vessel, put it in water and test it under real conditions.

A

And collect data and building that vessel is expensive, both time-wise computational-wise and money-wise.

A

So it's not economical to do that.

A

The other alternative is to do simulations.

A

You can do simulation. You can basically design this hydrofoil on your computer, use your favorite cfd methods.

A

And collect data, even that is not cheap.

A

It's very expensive computationally, because each data point that you're gonna collect for us. It was taking six hours. We had this reynolds average uh turbulence model and it was really expensive.

A

So collecting data in our fields is really expensive and we have to start learning how to live, really small data. So what did we do?.

A

We want to be as data efficient as possible. First, you do your parameterization of the geometry. That's the 3d version of the supercavitating hydrofoil.

A

Then you do a 2d cross section of that and you can parameterize that by splines for instance. In this case, we have 16 parameters.

A

You can take that.

A

Take that geometry feed it into uh your favorite cfd solver.

A

And then out comes lift over track, so this was one simulation. Then you have to do it multiple times.

A

And we want to maximize liftover drag because we want to. We want the vessel to lift itself from the water, as you can see, on the plot on the movie on the bottom. Here.

A

And then that's why we want to maximize lift over track. It's a constraint maximization, so you want to maximize liftover drag subject to some constraints, but we don't care about the constraints. Now in this talk, let's see how we try to model this.

A

We try to model that using gaussian processes. Why? Because gaussian processes are data efficient and I'm gonna show you why they're data efficient.

A

Like any other machine learning framework, gaussian processes have four components. There is data. There is your prior. The choice of the prior. There is training, and then there is prediction: what is our data?

A

Our data are in the form of y, I being equal to f of x, I plus epsilon some noise model, which we are going to assume it's normal, because we are going to try to do regression. So it's a pretty standard assumption when you do regression.

A

So, let's connect this slide to the previous slide x. I is gonna, be one of the shapes y. I is gonna, be the corresponding leftover drag, so you can have multiple x corresponding to different geometric configurations, and if you remember in our case x, I was 16 dimensional.

A

For that particular example, so x16 dimensional object goes inside the function and out comes left over track.

A

If you put everything in vector, notation, basically put your data inside a long vector, this is going to be a vector. This is going to be a matrix.

A

This is going to be a vector and it's going to be n dimensional, because you have n data, that's what you get yet y is equal to f of x, plus epsilon.

A

The question is, how are you gonna model f,.

A

The problem is that you don't observe f, you observe only samples from f, you see a shape and you measure the corresponding lift over track. You see another shape and you measure the corresponding lift over track, and this could be noisy measurements.

A

Maybe your simulation is not accurate.

A

Maybe you did something wrong while setting up the simulation, etc.

A

So you only see noisy observation. If we knew f, our problem would be solved, then there is no need to do regression.

A

So, let's put an assumption on f, we say: f is a gaussian process. That's our assumption. That's our prior knowledge with mean zero and a covariance function, but the what I'm writing up.

A

There is just a shorthand notation for saying if you take any two points or any two shapes evaluate the function at those two points function, values are going to be normally distributed with mean 0 and a covariance like this you're, going to evaluate k at x and x, k at x and x, prime x prime and x x, prime and x prime, but rather than writing this formula all the time this is going to be cumbersome.

A

We are going to use the shorthand notation up there, so this formulation is just a shorthand notation for what we have down here.

A

So, whenever I write f of x is a gaussian process, that's actually what I mean.

A

And now the question is: what is k for k? You have to make assumptions, it's like choosing a basis function and you have multiple choices.

A

It could be finite element basis functions, it could be spectral basis, functions, etc. Wavelets you name it here, I'm giving you an example of a squared exponential basis function, so this is a squared. This is exponential. That's why the name squared.

A

Exponential, the only thing that matters as you can see this kernel is stationary.

A

Why? Because only the distance between two points are appearing in our kernel: formulation.

A

And what it says it says that if the distance between two points get really far from each other because of the exponential and the negative sign here, the kernel value is going to go to zero. It means that these two points are not going to be correlated.

A

And it makes sense if two points are really far from each other: they're not gonna affect each other. That much that's an assumption that you're making. Maybe it's wrong. Maybe it's right, but that's your prior! That's a choice.

A

Once you make that choice. You're gonna end up with two hyper parameters that we're gonna train using maximum likelihood.

A

So, let's try to write down our likelihood.

A

Y is normally distributed with mean 0 because of our prior assumption.

A

And a covariance matrix this time, this is a matrix. It's coming out of evaluating your covariance function. This part is coming from the prior side. This is the assumption that we made.

A

Plus the noise model, or the likelihood model which is coming from here.

A

That's training we want to maximize the likelihood and the normal distribution has an exponential term. If you write down the formulation, it has an exponential, we want to get rid of the exponential. That's why we take a log and in machine learning. Usually we like to minimize rather than maximize, so we are going to minimize the negative of the log of the likelihood.

A

And that's what you're going to get? That's our objective function and there is nothing complicated about it. It's just normal distribution when you take a log of the exponential- and these are the terms that are going to be left plus some constants and the constants are not going to affect your optimization, so we're going to drop them.

A

So I promised to tell you why gaussian processes are data efficient and everything is lying here under this formula.

A

This term is trying to fit the data as closely as possible.

A

So we want to this is a data filter and there is a log determinant here. The log determinant here is trying to find the simplest model that is explaining the data.

A

So this term is having a regularization effect and that's why gaussian processes are data efficient? Not only you try to fit the data, but you want to find the best model, that's fitting the data.

A

So there is a question it says for the calcium processes regression framework. Can you apply this framework to objective functions with non-gaussian noise models as well.

A

Absolutely so, but what the problem, what problems is it gonna cost? The question is what, if this is not.

A

Gaussian noise model.

A

What if you have a different type of a noise model, the problem that you're gonna see it's doable? The problem is that, then you don't have a nice likelihood and then you're going to have to use approximation methods.

A

And your likelihood is not going to be normal anymore. It's going to be a different distribution and you're going to learn how to work with that type of distribution. So your formulas are not going to be nice and.

A

Easy, so there is a question: these are creaking and or radial basis function models. Yes, so these are really close.

A

These two topics, gaussian processes and trigging, are very related, and what you are seeing here are your radial basis functions. These are rbfs.

A

But there there's a catch here.

A

There is a slight difference between cricking and gaussian process.

A

And the cash is the training step for krigging. You usually assume that you know these parameters, these two parameters and then everything is going to be fast and easy, but then in train during training with gaussian processes, you have to find the best. Basically, you are trying to find the best hyper parameters and it's going to be the best basis function and you are trying to adapt your basis to your data and that's the role of the training, and this step is missing from cricket.

A

Prediction now that you train your model, you know what are these two? You know omega and you know gamma.

A

Now everything is going to be nice. Why.

A

Because now somebody is going to give us a new point: a new geometry x star.

A

We have two options: one is to go and do the simulation, which is gonna. Take us six hours.

A

Basically, you run your simulation. You come next day and check it out and you read: the value of leftover drag. That's one option. The other option is to fit that geometry through a gaussian process and that's going to be giving you your results in milliseconds.

A

For that particular geometry, you're gonna get an estimate of lift over track, and that's what you want to do here. You have f of x star. This is gaussian. That's the assumption that we made y is also gaussian. That's another assumption that we made it's up here. Actually.

A

Now you can write down the correlation between these two, basically, between your prediction and your data. You're, going to get a noise mean of 0, a covariance of x, star x, star x, is star and x.

A

This is your data input, data and xstar and k.

A

Okay, now that you know the joint distribution, you can compute the conditional distributions.

A

You condition on your data and you get a mean and you get a variance and I'm sure some of you are saying.

A

Why are you assuming the mean, is zero?

A

Again, that's a prior! That's a choice that you have you don't have to assume the mean is zero. You can assume something else, but even if you don't assume that the mean is zero. Basically, even if you assume the mean is zero, you're gonna get a mean after conditioning on the data, and this is going to be a non-zero mean, and you have a variance, that's updated. What are the formulas for that?

A

That's your mean and if you expand things out in terms of your data you're going to get a summation of k of x, star and x, I times alpha. I and let's call this alpha. This is going to be a vector. Let's call it alpha.

A

Now you see that there is a summation of a bunch of basis functions and your basis functions are basically these rbfs that are being evaluated at those points.

A

That's the mean, and this is the updated variance after seeing the data.

A

Let's see an example.

A

This is a one-dimensional example. Our case was 16 dimensional.

A

But this is a good illustration. This could be the shapes.

A

The space of shapes or the space of geometries this is the corresponding left over drag. That's the x-axis, that's the y-axis.

A

Every circle that you are seeing here correspond to your training data. You do a simulation for that geometry, you get the corresponding left over drag and that's going to give you this observation. Similarly, here you do another simulation.

A

It's going to take you six hours, you compute it's going to give you the corresponding liftover drag and that's going to be. Another data point input, output, data.

A

And let's say you did 20 simulations, the problem is: if we knew the blue line, our problem would be solved.

A

We don't know, we don't see the blue line, the blue curve, we don't see it, we only see through observations and doing simulation.

A

But what we can compute out of our gaussian process regression is the mean, which is this dashed red line.

A

And you can compute the to a standard deviation.

A

So not only you get a prediction: you get a nice uncertainty band around your prediction.

A

So far so good, how did we actually use it in practice?.

A

We started with two simulations this one and that one or a couple of simulations, maybe 10 simulations or more in higher dimensions. You need more simulations to start with, so you start with two simulations.

A

You do your regression.

A

We don't see the blue curve if we knew the blue curve. The problem was solved. You do a regression.

A

And that's going to be a terrible estimate initially for our f everywhere.

A

A

What is the objective here? We want to maximize liftover track, so we want to find this maximum.

A

And we don't see the blue curve. What are we going to do? We're going to take the mean? Add the 2 standard deviation to it from our prediction model. That's going to be f bar of x, the mean plus 2 is standard. Deviation, that's going to give you an upper confidence bound this curve. You see now the blue curve you couldn't see, but the red curve you can see and you can evaluate it. If you can see and evaluate it, we can maximize it.

A

So, rather than maximizing the blue curve, we are going to maximize the red curve. You find the maximum and it's going to correspond to this particular geometry. For instance,.

A

So the algorithm is telling us to go and sample here.

A

We are gonna, go and sample there, we are gonna, do a regression and everything is gonna, get updated.

A

We compute the upper confidence bound. That's our acquisition function.

A

We find the maximum again. So the maximum of this curve is this point. It corresponds to a particular geometry. That's the arc max you take the arc max put it here. Do your simulation.

A

And then this data point is going to get added to your list of data points that you collected so far. So far we collected three. We are going to collect the fourth point: we collect a fourth point: everything is going to get updated now we are also getting closer to the maximum.

A

And if you look at our acquisition function, it has a maximum value over there, so the algorithm is going to think that the maximum is there. It's going to tell us to go sample there, we're going to go ahead and sample there do your simulation, but then something nice happens.

A

Now the maximum of the acquisition function is somewhere on the boundary, so the algorithm is not sure whether the object that is finding here is actually the maximum. So it's going to explore a little bit, I'm going to explore this point.

A

It's going to explore this other point, because the uncertainty is huge, so it's going to reduce the uncertainty over there. There is another uncertainty over there. So it's going to try to explore that part of the space.

A

And now the mean is trying is dominating because we push the uncertainty down in all of the other regions. So it's gonna explore.

B

Actually, exploit explore, exploit explore, exploit.

A

Exploit and it found a maximum, so it's very data efficient with 13 points. We didn't have to evaluate our function everywhere in the regimes that we were not interested in.

A

We only have to evaluate 30 points, and most of them are near the maximum value.

A

So this part of the curve we reduce the uncertainty excessively and the other part of the domain.

A

We just reduce the uncertainty enough so that we can find the maximum.

A

If you wanted the red curve to match the blue curve, you would probably need to put a point here. That's a redundant simulation. You don't need that.

A

You would probably need to put a point here and do another simulation there. It's six more hours. So, let's see how many hours we are saving six six hours here, another six hours here, 12 hours, some more hours here, because you want to match the care, and these are the costs that we are avoiding by using the bayesian optimization framework and for more information. This is a very good paper to refer to.

A

So that's how you use the uncertainty coming out of gaussian processes and that's how we actually optimized our super cavitating hydrofoil. We had another idea also, and the idea was this: not only we wanted to be more data efficient.

A

Not only we had this software that would take six hours to give us simulations to do the simulations we had another solver that would take four hours. Sorry, four minutes.

A

It means that you can generate more data from that cheaper solver and that cheaper solver, we called it low fidelity.

A

Why is it of a lower fidelity and what's the definition of a fidelity, it's of a lower fidelity? Because you can it's fast? It means that you can generate a lot of data from.

A

It and at the same time, it's less accurate compared to the model that that takes six hours, so it's less accurate. We didn't want to throw away that data.

A

And the framework that we came up with was.

A

You try to write down the correlation between two functions now and the model that we wrote and that's the prior assumption that we made it might be correct. It might be wrong, but once it sees the data, it's gonna fix it and that's a very simple model. You say our high fidelity model is a linear combination of our low fidelity models. This could be a vector. We can have multiple models, plus a noise model.

A

Now everything is gaussian processes, that's a gaussian process. This is a gaussian process. That's gonna as a result, be another gaussian process. So these are the assumptions that we made and we had two independent covariance functions. Two kernels.

A

So these two processes are independent.

A

If you write down the formulas, you can see that these two functions are gonna. Give you a two-dimensional gaussian process with mean zero and a covariance, a matrix of covariance functions.

A

What is the covariance of the low fidelity with the low fidelity? It's going to be k1? What is the covariance between the high fidelity and low fidelity function? Basically, what is the covariance of this and rho and f of l?

A

That's going to be rho, l, rho covariance of fl, which is k1. Similarly, you can find covariance of hh.

A

Okay, so far so good, how did we use it?.

A

We are interested in getting a pretty good estimate of the blue curve on the right, because that's our high fidelity model.

A

We do four simulations. This is one simulation, another simulation, another one and another one, four simulations. We know the corresponding liftover drags.

A

And we don't know what is the blue curve? We want the red curve to be as close as possible to the blue curve. We want our regression to be as best as possible and at the same time we don't want to do any other simulation from the high fidelity model, because it's expensive it takes six hours.

A

We are gonna, do simulations using the low fidelity model, which is gonna, take four minutes per data point.

A

Let's add three points to the curve on the left. Basically, do three simulations three low fidelity simulations and the curve on the right upset updates. The reason it happens is because of the assumption that we make the prior assumption that we make.

A

And these two functions are correlated any observations that you add here once you condition on them is going to update the posterior on the right. So let's add three more points and bigger on the right. Is updating and getting more accurate, add three more points.

A

It gets even more accurate three more points.

A

Three more points and you nail the function on the right.

A

Even low fidelity data are helping the high fidelity function.

A

How did we use it in practice? We started with a model with a design that was already optimized using genetic algorithms.

A

And we were able to so I was telling you that you want to maximize lift over drag. You can also minimize drag over lift it's the same thing. We were able to minimize that objective function even further down, and this is gonna end up with fuel efficiency, etc, and it's gonna end up.

A

It's gonna translate into money down the road, not only we did it for the hydrofoil. We did it for everything else. In our network.

A

A

Every single component of that vessel and we were able to optimize the entire thing and in the end, we built a model of that a realistic model and put it in water and tested it, and things were working.

B

A

So that's how you use bayesian, optimization and multi-fidelity in practice using simulations, but what is missing from this framework.

A

That's an idea, many engineers like it, it works in practice, but what is the catch?

A

And some people call this framework physics informed.

A

But it's physics informed in an indirect fashion. You do your simulation.

A

And then, indirectly and implicitly, you are trying to transfer those data coming from your simulations to your machine learning framework. In this case our gaussian processes.

A

So it's indirect, it's implicit. Can we make that explicit? We know our equations. Can we go ahead and explicitly physics inform gaussian processes.

A

We are gonna physics, inform neural networks later on in the talk and the ideas are very similar to multi-fidelity to just what we saw. We just want to correlate two functions. Let's try to do that.

A

Let's say this: is your physics berger's equation, a simplified version of navier stokes, and this is our physics for now.

A

We're gonna discretize our physics, our equation.

A

Using backward euler, basically, u n, minus? U n minus 1 divided by delta t is whatever that's left.

A

What we are gonna do is gonna assume that, u n, is a gaussian process.

A

So whatever that I'm gonna tell you started from this simple observation that the derivative of a gaussian process is a gaussian drop, is another gaussian process. So the derivative of the gaussian process is a gaussian process, of course, with a different kernel, the derivatives are going to go on the kernels, but that's a crucial. That's a crucial observation.

A

Let's try to use that and we are seeing derivatives all over the place here. So let's try to use that that observation? U n! That's our prior! We are going to put a bunch of basis functions on that in our case, we're going to assume it's a gaussian process with a mean 0 and a covariance.

A

We are going to push the gaussian process through the differential equation.

A

And therefore be able to correlate: u n minus 1 and u n! So we can write down the correlation between yesterday and today you get a bunch of kernels. The kernels are really nice.

A

For instance, if you look at this there, you can explicitly see the berger's equation sitting in the formulation of kn and minus one. There is k plus delta t: u n minus 1 k x. You are taking derivative of k with respect to x.

A

So now you are taking the derivative of your covariance with respect to the first input and you're subtracting this coefficient and you're taking the derivative twice.

A

What did we just do?

A

Our physics is now inside our prior. This is a physics informed prior, this is a physics information process. What do I mean? What's the definition of that? What does it mean to be physics informed?

A

It means that any samples that you generate from this prior is gonna satisfy your equation.

A

Any prior, regardless of the hyper parameters, any random samples from this distribution is gonna satisfy this equation by construction.

A

So now you start to see the difference between this framework and the previous one.

A

The previous one was implicit and everything was going through the data now you're, making things explicit.

A

By saying that, I want a prior.

A

And from that prior, any sample that I generate should satisfy my equation. That's the definition of being physics, informed.

A

That's one observation: the other observation is the similarity between this framework and multi-fidelity.

A

What we are saying here is that yesterday is a year is a low fidelity version of today.

A

Today is a low fidelity version of tomorrow.

B

A

So now that it's very similar to multi-fidelity, it means that you can use this to do cool stuff with differential equations, for instance, you can try to solve them. That's the first thing that comes to our mind. Let's try to solve our differential equation, but now you can solve it in a data driven fashion.

A

You can have observations.

A

That are these red circles. Some of them are really outliers.

A

And this is going to be your initial condition. You don't see the blue curve if we knew the blue chair, we were done.

A

We wouldn't have any problems, but now we have a problem because we don't see the blue curve.

A

These are our observations. We start with observations. We also have some observations on the boundary using those two observations. You have a lot of observations on yesterday, like your low fidelity, and you have very few observations on your high fidelity and we know that low fidelity is going to help high fidelity, we're gonna use that we're gonna condition on this data.

A

First, do your training do your conditioning and once you do your conditioning, you can write down your posterior and that's gonna, give you the prediction for a little bit ahead of ourselves, basically, one time step in the future perfect. So now what we did? We took one time step now we have a problem.

A

To begin with, we have data. This is our data, the red circles. We don't have any data at this point in time.

A

But we know our posterior from the posterior. You can generate artificial data, basically sample a bunch of points here in your space from this red curve.

A

These are gonna, be your data points and they're gonna be artificial. They don't have any meaning they are coming from the posterior. So you generate some data.

A

Now you have data, but what's the difference between this type of a data and the data that you have here, the data that you have here is noisy, and these are the noise.

A

The distance between the red dot and the blue curve is going to give you the noise, so this data is noisy, but you are certain about their location. There is no uncertainty of the location of on the location of this point.

A

Everything is certain, it's noisy, but we know the location and we are sure that the location is there. It's deterministic.

A

The data that you have here is noiseless it's sitting on the red curve.

A

So it's noiseless. There is no noise, you are generating it artificially, but they are uncertain and that's the uncertainty. You are not sure, because your data point could be here could be here. It has a distribution, but there is this tiny difference between being noisy and being uncertain for uncertainty. You have a distribution for being noisy things are deterministic, but there is a gap between the observation than the actual truth, but the location is deterministic.

A

Now you have an uncertain data point here. You can do the same thing. You can do your training on that data. You can do your prediction.

A

But you need to take into account when you're doing your prediction, that your data is uncertain, so you need to propagate the uncertainty from this time step to the next one.

A

And that's what we're gonna do we predict into the future and then the rest of it is very similar. It's gonna be repetitive, you generate artificial data, add a bunch of random locations, condition on those locations predict into the future and then at time one.

A

You get a pretty good estimate of the shock in your burgers equation and you also have the uncertainty.

A

The question now is: what does this uncertainty mean.

A

The uncertainty here.

A

The uncertainty.

A

A

It's gonna give you an estimate of the honesty of the method. The uncertainty is telling us that any of these samples that you generate from this from your distribution could be a solution to your differential equation.

A

And there are two sources of uncertainty.

A

The first sort of uncertainty is is coming from the initial data.

A

The other sort of the word, the other source of uncertainty- is that maybe you don't have enough artificial data points?

A

Maybe you don't have enough computational power to put a very fine grit, so it's going to take all of those into account and then it's going to give you a nice uncertainty band around your prediction and ali of any of these could be a solution.

C

Hey um sorry, I just want to give you a quick time check. Just so you're aware, there's about 27 minutes left.

A

Okay sounds good: okay,.

C

A

There is a question: why is the answer that too much is smaller in the middle, especially with predictions at much later time step? Does this have something to do with the initial 24 training points.

A

That's a great question, and this is counterintuitive. You might say that why isn't your uncertainty exploding as you go in time, because the uncertainty here means something else?

A

If you look at the math carefully, there is a one-to-one correspondence between karma filters and gaussian processes and numerical gaussian processes.

A

Kalma filters are for.

A

Low dimensional systems, actually it could be high dimensional, but it's finite dimensional. The dimension is finite. It's an ode! It's an ordinary differential equation. What we have here is for infinite dimensional stuff. It's for functions for functional spaces. It's for partial differential equations.

A

The question is: why isn't the uncertainty blowing up.

A

The steps for numerical gaussian process is very similar to what we just saw for a regular gaussian process and for a multi-fidelity framework.

A

You have your data, you have your kernel, you minimize the negative of the log of the likelihood and that's how you do the training prediction you condition on the data you condition on the boundary data and you condition on the artificial data.

A

That's going to give you the mean, and that's going to give you the variance. The difference is here. That is what we are used to.

A

The term here is how the uncertainty is being propagated inside the system. So the uncertainty is gonna, go.

A

It's gonna propagate outside of the differential equations.

A

So it's not like you take a gaussian process. You push it through a non-linear differential equation and you get a non-gaussian distribution out. No, the uncertainty is being propagated outside of your system outside of your differential equation and that's how the uncertainty is propagated. This uncertainty is the uncertainty of our artificially generated data points.

A

So there is a great question. It says what? If? U n, condition on? U n minus 1 being uh normal is violated, is common in common filter. We could use unscented, comma filter what happens with gps. You can do the same thing here.

A

If you have a different prior assumption, you can.

A

Modify the method, so these could be future extensions of that paper.

A

Yeah, I guess I answered the question.

A

And, as I said, these are artificially generated data. It means that they have a mean and they have a variance, that's the meaning of being artificially generated data and that's how you propagate the answers. Indeed, that's the uncertainty that we plug in inside this formulation.

A

So far so good, we were solving four word problems. Can you solve inverse problems.

A

Let's say this: is your physics: chromato si weissinski equation and you have three unknown parameters: lambda 1, lambda, 2 and lambda 3..

A

You go ahead and discretize in time like what we just did for berger's equation. We assume, u n, is a gaussian process. We push it through the differential equation. You get a bunch of kernels.

A

The physics is inside the kernel, but something nice happens.

A

The same maximum likelihood framework that we were using to find the best basis functions, find the best thetas and do our training is going to help us find the best lambda, 1, lambda, 2 and lambda 3.. Basically, the parameters of the differential equation turn into hyperparameters of our kernel.

A

You do your training on two snapshots of the chaotic system.

A

You have 301 data points in space, so these two plus that I'm putting here are just coming from snapshots of that plot. Up there we don't see the blue curve. We see the red cross is.

A

And the crosses, don't have to be an agreed.

A

You do your maximum likelihood estimation and you can get an estimate of the true parameters. These parameters should be all ones, but now these are the estimates. You are making some errors, but the errors are negligible, even in the presence of noise.

A

So here is the question: why is the method data efficient.

A

We are not using the entire data set that could be our entire data set. We are just we are using only two time snapshots and even in the space, we are very sparse.

A

301 data points and 299 data points.

A

The reason for data efficiency- there are two reasons. One is that gaussian processes are data efficient by construction. Because of that log determinant there.

A

The other reason is that we are putting a lot of structure on our kernels on our prior. Many functions are not going to happen. Any function, that's not going to satisfy this equation is not going to be inside our prior.

A

It's just not going to happen.

A

So we are filtering out a lot of unnecessary samples and that's why, even in the presence of noise for a chaotic system, you get pretty good estimates of your parameters.

A

Yes, you lose some accuracy, but that's okay, because any level of noise is gonna break down any system.

A

If you add enough noise to any system, it's gonna break it.

A

But we are being very data efficient because of these two reasons, so the physics is helping.

A

You can even go to navier stokes flow past the cylinder you collect data from two time snapshots, not the entire video, only two time snapshots.

A

That is one time snapshot. That's another time snapshot, and these are the data points where you collect your samples.

A

You are not collecting any data on the pressure.

A

Because if you want to measure the pressure, you are interfering with the dynamics and the velocity is going to change and we were trying to estimate these two parameters. Basically the reynolds number that has something to do, with the shape of the object in front of you, etc, and these are pretty good estimates in even in the presence of noise.

A

A

We try to apply this method to.

A

To an experimental setup, this is a subfill.

A

Some of our colleagues at purdue did some very nice work.

A

To set up the experiment to collect the data and to visualize the data they gave us this data, we fit the data to the framework. The framework failed, and the reason for that is that our framework is physics informed.

A

If we don't get the physics right, the method is not going to work. That's the drawback! Yes, you are very data efficient, but you need to get your your physics right.

A

What's the catch here, there are two cats: one is that soap fill is probably non-neotonian.

A

The other catch is that the setup was from upside down and there is some effect of gravity on the flow. So you need to get your physics right.

A

So far, machine learning was helping us now. Let's try to help machine learning back. Gaussian processes are not good for big data. The idea of artificially generated data from numerical gaussian processes is going to help us extend gaussian processes to big data.

A

I'm not going to go into details, but the idea is that, even if you have a very high dimensional, very.

A

If you have a big data set, many of that data are going to be redundant and if that's the case, you can reduce the number of data through our framework and then you do a gaussian process under reduced data. It's similar to dimensionality reduction ideas. Sometimes, yes, your problem is high dimensional, like images, but there is an underlying low dimensional manifold in a high dimensional space. Similarly, if you have a very huge big data set, maybe most of them are redundant. If that assumption is correct, you can reduce the dimension.

A

You can reduce the number of data and work with a smaller dataset and the idea of artificially generated data is going to help us, and these are the ideas that we borrowed from numerical gaussian.

A

A

I think I will leave the questions towards the end of the talk and then try to answer them at that time. In the interest of.

A

Time, let's see how neural networks work, how can we physics inform neural networks? So far we were physics, informing gaussian processes and the tools that we were using were symbolic differentiation, maple mathematica.

A

For neural networks, we know that they are working. One of the reasons that they are working is that we are really good at taking derivatives.

A

We I mean we as human beings, we are really good at taking derivatives and we didn't know about it, and the idea is to use automatic differentiation back propagation chain rule, that's how we are going to physics and for neural networks a little bit about neural networks.

A

You have the same four components: data prior training prediction data is the same as before.

A

The only thing that changes is the assumptions that we're gonna make on.

A

A

F, if we were doing scientific computing, we would say f is a linear combination of a bunch of basis functions perfect! That's! What we're going to do f is a linear combination of a bunch of hl.

A

Now hl, we don't know what hl is, but we can say it's a linear combination of a bunch of other bases. But then, if you do that, if everything was linear, you are just multiplying two linear things. Together, the model is still going to stay. Linear.

A

That's why you introduce a non-linearity, any non-linearity would work dot, dot dot. You do the same thing for your inputs.

A

For training, you do your likelihood.

A

You do the negative of the likelihood, the log of the likelihood to get rid of the exponential in the normal distribution.

A

That's the term that you get. If you expand it out in terms of your data, you get mean squared error.

A

That's our training process. Prediction is easy in neural networks. Once you know your parameters, you're just gonna. Take access, start push it through your neural network. Get the prediction out. There is no conditioning.

A

Nothing so I'm going to compare gaussian processes to neural networks. Gaussian processes are non-parametric, neural networks are parametric and these are the parameters. What do I mean by non-parametric? You have to carry your data with you all the time, because you need to condition on them.

A

Gaussian processes don't scale to big data. Neural networks do because of mini batch training. You don't have to look at your entire data. You can take a look at it in many batches.

A

Gaussian processes quantify the uncertainty in their prediction. Neural networks, don't gaussian processes, balance the trade-off between data feed and model complexity, neural networks, don't what I'm saying here? These are the extremes of the story.

A

There are ways to make gaussian processes scale to big data set, and we just did that in the previous slide.

A

There are ways to encode uncertainty in neural networks.

A

Using, for instance, dropout.

A

There are ways to stop neural networks from overfitting.

A

But you have to work harder, it's against their nature. The nature of a neural network is that it's parametric the nature of a gaussian process is that it's non-parametric.

A

Okay, let's try to physics inform it. This is our physics.

A

We are gonna, say, u.

A

Is a neural network that takes time and x as an input and outputs the solution.

A

We take: u, we push it through automatic differentiation and we obtain ut the derivative of u with respect to time, and you return f, you return your residual. Take the derivative once take the derivative once with respect to x, twice return the residual.

A

Now this is physics informed. Why is that any samples, any random configuration of your weights and biases, is gonna satisfy this equation.

A

That's what it means for it to be physics, informed. It's a physics informed prior any samples that you generate is going to satisfy this equation by construction.

A

How do you get the physics out you get the physics out with.

A

Doing regression, so you have some initial data and boundary data. That's the data on you and you want. You are not interested in any f. You are not interested in every residual. You are interested in a particular f and that particular f is the one that's zero.

A

So you want to make f as zero as possible on a bunch of collocation points. It means that you can solve berger's equation.

A

Using noisy data, initial land boundary and that's nice, you got a solution. I'm now going to a question from the audience in the question and answer the question: is you can solve differential equations using other methods? Why would you use gaussian processes? What would you use neural networks for gaussian processes? The answer is clear: you get the uncertainty, you get a nice visually speaking and quantitative, as well as qualitative uncertainty bound around your solutions with neural networks.

A

To be honest, I don't see any advantage for using neural networks to solve differential equations other than it is. It doesn't depend on your grid, it's meshless other than that. I don't see any advantage.

A

That's from the scientific computing aspect from the machine learning aspect: you are training a deep neural network using only 100 data points and that's impressive.

A

These are 100 data points. The neural network has millions of parameters and you're training that with a small data set, so it's better to look at it. This way from the machine learning aspect.

A

There is nothing special about schredinger equation. You can solve it using this framework, I'm including it to say that if your initial prior is 10 layers deep after these differential operations which are non-trivial, these are back propagations.

A

The network is going to become deeper and deeper. Let's say it's 10 layers, deep after this derivative, it's going to become 20 layers deep after taking the derivative twice, it's going to become 40 layers deep.

A

So these are really deep networks and the depth depends on your equation and the number of derivatives that you have yes and then you can solve schrodinger.

A

I'm not gonna go through this slide. I'm gonna share the slides with you guys, but it's a way of comparing pins to finite elements and reduce other models when it comes to forward solver.

A

Neural networks are going to shine in high dimensions because they are gritless in 1d. Yes, you can put 10 grids. You can solve your differential equation using finite differencing in 2d. You need to put 10 to the power 2 grid points in 3d, it's going to be 10 to the power 3. in 4d. It's going to be 10 to the power 4 and then 100d forget about it forget about grids so in high dimensions. Neural networks shine.

A

And there is this nice theorem relating forward back for the stochastic differential equations and high dimensional differential equations.

A

And the idea is that we are looking for you. If you are looking for you, you can approximate it with a deep neural network. That's what we are going to do.

A

We are going to do the training and our loss function is going to come from discretizing this differential equation and you can solve brachial's equations in 100 dimensions and hamilton, jacobi bellman equation in 100.

A

C

Just one last quick time check measure about a few minutes left, if you could finish up in a few minutes, is that okay, yeah, okay thanks.

A

And the other advantage of neural networks is when you want to do inverse problems. Now, let's say you want to find what is the reynolds number you can write down your residuals.

A

We can collect data behind the cylinder on the velocity only and then to be honest, when we were writing that paper, we were trying to find the reynolds number this number and this number here, but then something interesting happened. The pressure popped out of nowhere, it's popping not out of nowhere. It's popping out of our equations enforcing these equations in our loss function, but we had absolutely no data on this.

A

This is good because now you can apply this method to arteries.

A

And try to diagnose stenosis.

A

And you can apply it to.

A

Flow visualizations and from flow visualizations you can get quantitative information out.

A

And this is exactly the method that we applied on this data set that I started to talk with.

A

We can apply to vertex induced vibrations flow structure.

A

Interactions, you can apply the same framework to eulerian and lagrangian differential equations. Basically, your your data are in the form of particles and then you want to know what is the flow like?

A

You can apply to turbulence.

A

And you can apply to actually learn your entire dynamical system. You can learn the dynamics of lorenz system, hubspot, hop, bifurcation, glycolytic, oscillator, typical of biological systems and navier-stokes equations.

A

You can learn infinite, dimensional differential equations.

A

So now you are putting a neural network on n and thank you for your attention.

A

And I think I'm right on time.

C

Thank you very much, yeah, perfect timing. um I guess, um if you still have time you can stick around to answer some of the other um questions on the call if anybody needs to drop off, though, thank you very much for coming and we'll see you for the next one.

A

So the question is so: can we not have some generic prior and later find the coefficients for each term.

A

You could, even if you put a generic prior you're, still putting a prior.

A

In the interest of getting the physics right without doing experiments, can we rely on high fidelity.

A

Simulations, yes, definitely and that's exactly what we did in our science paper, so we were relying on our simulations because that's going to give us a more accurate underlying truth, at least in the simulated world, which is totally different from the real world.

A

Can you apply neural odes to this framework to increase computational efficiency and prior flexibility?

A

Could I like the idea of neural lodes, but the catch with neural odes? Is that they're slow and they are for ods not for pdes? These are complicated topics like navier stokes is not an ode. Even if you discretize it, then it's not going to be easy. Working with that.

A

And they are slow because you still have to go through your solver, which could be your differential equation, solver backward and forward backward, because you want to tweak the parameters and forward because you want to solve it.

A

The cool thing about uh hidden physics models is that you are not going to go through your differential equation, because the differential equation is there is inside your prior, it's inside your loss. In the end, how would you suggest tackling pdes when certain components emit are small compared to others?

A

Could it lead to a local minimum? Would you first of all about local minima? If people in deep learning were worried about local minima, deep learning wouldn't exist.

A

And if the sum of the parameters are smaller than the other one, there are some ways to normalize that, yes for neural networks to work normalization is a key component, so you have to make your parameters to be on the same scale more or less.

A

The next question can pins be applied in tensorflow 2, without disabling v2 behavior. How fast is pins compared to and then can it be implemented in tensorflow, two.

A

I'm sure you can, I haven't done.

A

A

And how fast is pin compared to neural networks, the computational cost of pins comes from the fact that the more derivatives you take, the more deep your neural network is going to get and the deeper it gets the more computationally expensive it gets.

A

Are you solve the differential equations in tensorflow, as you showed? Do you have sample codes? Yes, these are the sample codes. That's my github and you can have access to the sample codes and download it and practice how adaptable is pin to high performance computing.

A

It is very adaptable. The reason is these neural networks are gonna, sit perfectly on your gpus, a bunch of matrix, algebra and easy to parallelize them.

A

A

It's very easy to adapt them to high performance computing.

A

Can you elaborate a bit more on how different pins from running a neural network and simulated? That's exactly why I spent so much time on the first part of the topic to tell you that there are two ways to go around physics: informing things. One is you: do your simulation, get some data and then apply and deploy a deep learning framework, a usual deep learning framework on your data?

A

That's one way of going around things. The other way is to make physics explicitly.

A

Inside your prior simply use physics explicitly rather than implicitly through data generated by a simulator.

A

And I'm not advocating for one versus another, each one has their own advantages and disadvantages with adding other conditions to the loss function for training pins, like symmetry greatly help in convergence.

A

If you think about it, that's exactly what we are doing with our last function. We write our differential equation and we put the residual as a regularizer in the loss function.

A

The question is adding other conditions, like maybe lip sheets condition. Yes, maybe you want your solution to be lip sheets continuous. You can put that as a regularizer or as a penalty inside your loss function.

C

Okay, so it looks like you answered all the questions, my sir. Thank you so much for going through all of those. I guess we can close this session now, thanks for sticking around to answer those again and thanks to everybody for coming and asking all these great questions. That was a great presentation. I think folks learned a lot once again.

C

We have one more talk coming up on the agenda, so come back on october, 1st for zach ulysse from carnegie mellon who's going to talk about molecular representations thanks everybody and enjoy the rest of your day.