National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 14 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 5 - Uncertainty and Out-of-Distribution Robustness in Deep Learning

Description

Featuring Balaji Lakshminarayanan, Dustin Tran, and Jasper Snoek from Google Brain.

More about this lecture: https://dl4sci-school.lbl.gov/uncertainty-and-out-of-distribution-robustness-in-deep-learning

Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda

A

Thanks mustafa and thanks everyone for the invitation and for uh coming to our talk. So uh can you help hear me: okay, uh yeah, yeah, yeah, okay, great uh all right! So today uh we're gonna split the talk uh roughly in three parts.

A

So the way we have structured, it is maybe like 20 22 minutes each uh and like we'll, have like five to eight minutes for questions or you can even take a break if you want to and uh I'll present first and if you have questions keep uh uh we can, you can ask in the slack channel or post it in the zoom chat, whichever is convenient and we can in about 20 minutes.

A

I can take some of those questions all right um so today we're going to be talking about uncertainty in deep learning and uh this joint work uh with lots of awesome colleagues uh at google, deepline and elsewhere.

A

So uh I'll first start off by giving some background on why uncertainty is important, and why do we care about this topic so so before we uh get started to the methodology? What do we mean by predictive uncertainty so to make sure we are on the same page? um So what we mean uh is that we want to predict like output distributions rather than point estimates.

A

So imagine you are doing like classification problem like in the 2d example above where you have some features: x1 and x2, and you have two classes shown in like red and blue or you're, maybe trying to predict some trying to do some regression, where you have some feature x and you're, trying to predict some uh function of value y and uh typically, uh like you know, uh most people just do train a classifier and gives like a deterministic prediction, and here we don't want deterministic predictions.

A

We want a distribution to capture the uncertainty so for classification it could mean that you don't just predict which class it is. You predict the probability with which you believe that that point may be uh from the class and for regression you don't just want to produce a point estimate, but also a sense of the variance um and, as shown in the figure on the right.

A

So uh what are the sources of uncertainty? So uh in so there are kind of like a two sources of uncertainty that I'm going to be talking about. So the first source is some inherent noise uh in the in p of y given x itself. So uh you can imagine there is some noise in the labeling process like, for instance, uh here is an example from the fifa 10 h data set where basically, they took like images of from like c5 test set, and then they asked multiple raters to rate and here's the distribution.

A

So, as you can see, some images are pretty clear, like the human labels may be, like super clear like on the top. It's everybody agrees that it's a plane. The second everybody agrees. It's a cat, but the the last two images are inherently pretty ambiguous, so you could imagine it could be a ship or you know like an animal or something and the last one, uh since we can't see the head super clearly, depending on like how you look at it.

A

The texture looks a lot like a deer, but it it could also be a bird like. So there could be some human uh ambiguity in the label, so that would be in inherent distribution or like labels for a given input, rather than just like a deterministic label. And similarly you can have like this- for, um like you, know, measurement, noise and y.

A

So imagine you're running some experiments and you're trying to like regress, um like the the kind of like uh uh some sort of like function uh with respect to some parameters that could be like.

A

uh If there is inherent randomness in the procedure or some measurement noise, then every time we do the measurement, you may observe a slightly different value so that the the there's not a deterministic y for a given x but like an entire distribution- and uh this is uh sometimes also called you may see this term, which is called um alias uncertainty and the the the distinct property between the two types of uncertainty I'm going to be talking about is that this source of uncertainty is usually considered to be irreducible in the sense that even in the limit of infinite data, uh it's considered like uh to be reducible.

A

uh However, uh a lot of the this can be caused by, like you know, partial observability, in some sense that if you actually give additional features, you can reduce this. So for imagine, if you had like a higher resolution of this image or something like that, where you could see more clearly, you can reduce this uncertainty where you have that or whatever the.

A

If you give an image of an animal with the the head blurred or something like that, if you actually see the head, you can decide like what animal it is, but the respect, if you don't have that information, it could be ambiguous because of that.

A

So the next source of uncertainty is uh what we call model uncertainty and this basically arises because you have um this. Basically, this is because, given limited training data, you may have like multiple functions which are consistent with the observed training data. So an example of this is the figure on write where you see that you have two classes shown in squares and triangles, and we are trying to do a binary classification problem. But as you can see, there are multiple possible classifiers, which equally will explain the data.

A

So given just this limited data, we cannot precisely pin down, which is the this, is the only classifier that separates the data right like so, there could be like multiple, equally valid uh explanations, uh which can separate out the two uh datasets, so we are not sure which is the the the right function yet so maybe, if you observe more data, then the classifier would be like more constrained and um in the in the infinite data limit. uh Assuming that you have, uh your models are identifiable.

A

By that we mean that when you specify the problem, there is a unique uh optimum. You don't have like a symmetries or something like that which, uh uh if you so that you can actually identify each model so uh in in that case, as you get like more data, this uncertainty reduces. So this is also known as uh epistemic uncertainty, and uh this is considered to be reducible uncertainty, unlike the previous slide, where the its uh the data uncertainty process, even in uh the infinite data limit.

A

Basically so, um and how do we uh before I jump into like the methods? um The let's? I think it's useful to discuss how do we measure the quality of uncertainty, so there's a couple of measures that are commonly used and one of the terms you will hear a lot is what's called calibration and um like an intuitive way. To think about calibration is that it measures um how well the models when models predict distributions, they express their predicted confidence, which is basically their own estimate of the probability of correctness.

A

So that's how we can think about like the confidence, so the model's own estimate of the probability of correctness and you we can check how well it aligns with the reality. In the sense uh you can actually measure the accuracy on the on on the data set, and you can see whether, in the cases where the model is expects to be correct. What fraction it actually is correct and a common way to do this is the following, which is, uh I hope you can see my most uh pointer.

A

So what people do is in, like binary classification, imagine you are expressing probabilities in the zero to one range, so people usually bin up these probabilities into like multiple bins and um what you can do is for each bin. So in this bin, for instance, the model expects to be correct and 90 percent of the time like in the 90 to 1 uh 0.9 to 1 range. um The model expects to be correct uh in on average, like 0.95 times, or something like that. You can use the average expected gain for the bin.

A

Then you can make you can take all of the examples that ended up in this bin and you can measure what fraction of those the model is correct. So if it is well calibrated on all of these points, the model should be correct. 90 of the time, and that's what you plot here. So the x-axis shows the the bin, the average probability uh or average model confidence within the bin, and the y-axis shows the actual accuracy of the model in that bin.

A

And if it's perfectly calibrated, all the points should end up on a it should end up on the diagonal line, because in this bin the model expects to be correct.

A

Only 10 percent of the time, and so the average accuracy should be quite low, because the model uh is expressing a very high uncertainty, whereas in this bin, where the model expresses, you know very high confidence, you expect it to be like more accurate and, if it's well calibrated, you should expect the the predicted uh confidence uh to measure up with the actual estimate of the accuracy itself. So you can think of calibration as some sort of meta accuracy of fault, because it measures the alignment between the model's own confidence.

A

This is the accuracy and there are various aggregate metrics that you can measure how far the model is from the ideal calibrated curve, but one common way that people do is they take each bucket, and then you measure up the difference between the the the expected. This is the actual accuracy, and then you sum it up and basically um that's what is called the expected calibration error.

A

And um it's important to note that calibration is not always sufficient in the sense that that could be because what we really care about as a model to be as accurate as possible and also be as calibrated as possible. We want both of these criteria simultaneously and so uh like a simple way. A model can cheat by becoming perfectly calibrated, is imagine your probability on the test set is like all classes are equal, just like, or if you take mnist or c4, all classes are equal, so the model can just always output.

A

You know like the uniform distribution for trivially for all inputs and that, if you do that, then all of the samples will end up in like this bucket here, like the point one bucket, because the model just predicts a point on confidence, and just because of the statistics of the data, it can be perfectly calibrated, but it wouldn't be very uh accurate at all, because it's just like a random predictor. So you what we actually.

A

We also care about something called refinement and accuracy, and that's why, when we look at calibration of models, it's also important to look at the accuracy of the underlying models. Basically and the there are a bunch of other metrics that are commonly used and a lot of the these metrics were actually evaluated uh actually invented in the weather, forecasting literature so uh because they use a lot of like forecasters and that it's very important to assess the property of the probabilistic forecasters.

A

So this is a very nice reference if you want to learn more on proper scoring rules and evaluation measures for probabilistic forecast- and I just briefly mentioned two commonly used measures that people use, one is called the negative log likelihood, which is basically you take the confidence and the the logarithm of the confidence so that it has a proper scoring rule. um One implication is that it can sometimes over emphasize tail probabilities, um so it can be a bit sensitive to outliers.

A

So that's something uh to be aware of the other popular metric that a lot of people use is what's called the bryoscope, which is a quadratic penalty, as you can see here, which is basically taking your probabilistic. So imagine you're doing classification problems. um You're do if you have a like your probabilistic forecast.

A

You just can't measure the mean square as well, against your one heart target basically and then use the average, which is called the bryoscope and the nice property of this is that it is bounded, and uh so the the error on any one single data point is bounded in the zero one range that has a log that the range can be quite broad basically, and this can be useful and bryce core also has a couple of other nice properties.

A

In particular, it turns out you can actually decompose this into calibration and refinement, um which I mentioned in my previous life, I'm not getting into the details, but you can find more details in the in this paper above so um the other property that we care about, for uncertainty is uh evaluating on order of distribution inputs and, uh like an example of that, is this uh figure here.

A

Where imagine you are training a classifier on c410, um which is a popular benchmark, data set which contains images of like flights, birds, cats and so on, and what you can do is intuitively if your model, if you ask this model, uh something which is not one of the existing classes like, for instance, the example on the right, which is images from the street view host numbers data set.

A

So if you, if you show the classif like a classifier train just on fifa 10, uh if you show that, then it should uh say, I don't know or may. If, if it makes a prediction, it should be like with very low confidence that it's not one of the existing classes. Basically- and humans are great at this- so imagine you just speak.

A

Like one language, if I, if, if I'm shown like um like you, know images from like a different language, that I don't even understand, I can still say I don't know what it is, but I'm pretty sure it's not like an english character or something like that right like so.

A

Humans are great at this, and we want our models to also have this property that they can clearly predict none of the above or say when examples don't belong to any of the existing classes and some ways to measure this or basically taking like the model. Confidence, for instance, on the iid inputs and model confidence on what are called out of distribution inputs like od inputs like svh and above and you can look at like some sort of.

A

Like summary statistics uh like, for instance, how separable are the model confidences uh like the max confidence, the model science or the entropy of the p of y given x, uh that the model assigns for in distribution versus out of distribution? And you can also do like summary metrics of these statistics like, for instance, uh you can measure auc which measures how separable these curves are and so on.

A

So um I've talked a lot about. Like um I've talked a lot about uh the introducing um uncertainty. Let me spend a few minutes talking about some motivating applications, um because it's very important to ground all the research we are doing in on why it's like useful in the larger context of uh science right like so um like there's a lot of you, your applications of predictive uncertainty.

A

So one theme you will see over and over in this talk uh going forward, is it's important to know when to trust model predictions, and especially on the I uh under data structure. This is going to be a big problem, and uh uncertainty is also useful for decision making um and I'll show. Some applications like that and it's important uh for active learning, was to use the uncertainty to get more data in regions where you don't have a lot of data it will.

A

It also comes up in open, set recognition, lifelong learning and exploration and I'll walk through some of these examples on the upcoming slides.

A

So one use case is: what's called natural distribution shift, so we typically assume, like the test, comes from the same distribution as string, but that assumption is violated a lot in like real life applications, and uh this is an example uh of like street view images just uh like natural variations. um You don't have to do any. uh You know anything special to induce this kind of like chip. This naturally emerges and a lot of the data, so you can imagine like over time uh the the way the store storefronts look change.

A

So if you have been collecting, if you train a model based on like uh 10 years old or something like that, there may be a natural shift in how the these images look and similarly for countries. So imagine you have data in one country and you're training and you you. You want to deploy the model to like uh like data from a different country or so on. Then there will be some natural variations in the data, so you want the models to be like robust uh to such type of like natural distribution.

A

Another example is uh this that I mentioned before is open, set recognition, which is the case that uh the test input may not actually belong to one of the existing classes. So this like a really nice example of this- is this paper here where it's? Basically, what we are trying to do is uh predict bacterial species from uh genomic sequences and what happens is uh when you train the classifier. So imagine uh you at some point time and some point in time.

A

You take all the existing uh all the known bacterial species, and then you train your classifier right like so, uh but we don't the you. People have been discovering, like a lot of you know like new bacterial classes, that's denoted by the blue line uh here, so we don't know still like a lot of the the classes that exist, so we can train a classifier only at any point in time we can train a classifier on all the known classes, and if we deploy this classifier, what we see is um it it can be.

A

It can achieve high accuracy if the classifier belongs to one of the is the test. Input belongs to one of the known classes, but the, but when we deploy it since new bacteria are being discovered and there's a lot of things that do not belong to one of the existing classes, what can happen? We want the model to say the like.

A

If it encounters like an input our test time, we wanted to be reliably able to say it does not belong to one of the existing classes and not wrongly classify it as in distribution, because that can have uh you know this type of misclassification can have a huge impact. Basically, so we wanted to uh so.

A

This is the setting called uh open set recognition where you don't know all the classes uh at training time and at you you really need a classifier that can reject touch all the inputs in a reliable way um and another similar example is conversational dialogue system. So yeah look at uh this example on the right.

A

Imagine you have like a chat board which can only answer questions about um like you know, uh it can only answer questions about your finance or something like that, and if you ask a different question, like you know something about like sports or something uh or then, uh if it answers something related to, because it can only answer such questions if if it says that that can um lead to a very frustrating like human experience, right, like we've all been there, and uh it would be much more like a much more graceful way to fail.

A

Is you can say something like sorry? I can only answer questions about like this domain and the water value asking is out of scope in some sense, so um another application where uncertainty is very important, is uh medical imaging, so there's been lots of papers on using uncertainty from deep learning models to um improve uh uncertainty. Basically- and I have image uh like I- have some images from these papers uh check them out. uh One is related to like diabetic retinopathy detection, another one on iv uh disease classification from ocd.

A

um So here the the model sometimes predicts uh makes a multi-class classifier and it predicts uh how severe the condition is or like which cases need to be seen by a doctor and so on, and in these cases it's very important to have uncertainty, because you have like asymmetric loss functions and uh you may want to take the models uncertainty into account when uh knowing when to touch the model as well as uh maybe if the model is unsure, we shouldn't uh you know, pass the model's predictions, but um you defer it to a human or something like that, and that could also be like if there is a sufficient order of distribution inputs like uh if the.

A

If the image is not taken properly or something like that or if it's blurred or something like that, then the model should be able to reject it. Reliably saying that this is, uh but I mean, maybe the image is not centered or something like that, so the model should be able to reject, so that the image is uh maybe like re-take them, or something like that. So uh you can also have a lot of interesting use cases there and um uncertainty also comes up a lot in applications like bayesian, optimization and experimental design.

A

So here uh I have uh on the right. I have this image showing uh bayesian, optimization and action. So the the way uh we use uncertainties to decide the trade-off between uh so-called exploration versus exploitation, so uh in experimental design and basin optimization. We want to find the best set of hyper parameters and uh what we want to do is we.

A

We want to minimize the number of experiments, but we also so we want to intelligently pick uh the next point so that we do a efficient search and here, if we don't want to, if you have already evaluated one point, the model can pretty confidently assess um the you know like the unobserved target at like nearby points, a much more reliable way, whereas in points that are far away from existing evals, then the model is more uncertain.

A

So, depending on um the you here, you care about doing an accurate prediction of like the target, but also you need the uncertainty, because you can use the uncertainty to uh take better decisions and deciding trade-off by our acquisition functions and uh do more efficient search.

A

So I've talked a lot about like applications, and uh I wanted to point some concrete steps uh to uh highlight where current deep learning models fair, uh so that you can appreciate uh that this is indeed a prob uh problem that comes up a lot in the context of like deep learning methods.

A

So um I've talked a lot about like uh motivating applications, but for a lot of research. We want like benchmark datasets, where we can do large scale, evaluations- and you know, compare like different methods head to head and so on. To understand, which are the more promising methods and so on. uh One benchmark that has become quite popular uh in the field is what's called imagenet c, which uh contains um uh like corrupted versions of imagenet.

A

So, basically, in like typical benchmarks, we assume the uh we use an iid set from the same distribution, and then we know that deep learning methods can do very well. Like I mentioned, it is one of like the success stories of uh deep learning, and uh but what happens if you violate the train test assumption, so how gracefully these models fail and uh how robust are they and the way to measure?

A

This is to just take like some sort of like simple types of corruptions, like you know like gaussian, noise or, like you know, blurring, and so on and contrast and so on. So just take like simple types of corruptions and then you can increase the intensity. So each of these ops you can apply with like you, can increase the amount of blurring you do so that you have some increasing dataset shift so that you have a knob to move move from iid to more and more progressively out of distribution.

A

And then you can measure like how different methods uh fail in some sense to see whether intuitively we expect the model to be like most accurate here and less accurate going forward, and we can kind of like benchmark um how the models do.

A

So if we did like like a benchmark on like a lot of methods before uh in our previous paper, basically uh evaluating methods on the clean set and also measuring like how the how the model accuracies drop um as you increase the shift and as expected, you do see the moral accuracy to be dropping as you increase the ship. But the important thing from the perspective of uncertainty is that it's okay for the models to be wrong.

A

As long as you know, their calibration or their own estimate of confidence actually reflects this accurately, and so we can measure the calibration. So here is a plot of the expected calibration error of different methods. um What you can see is, unfortunately, the calibration of the models also becomes quite worse as you, as the shift increases. So in the sense what's happening is that the models are becoming more wrong, but their confidence does not really reflect that, so they are made even when they are making mistakes.

A

They are pretty confident- and this is a big problem, basically in deep planning and I'll go into the details of the methods later. But this just to showcase that this is still an unsolved problem and the other problem, also that I mentioned, which is um deep nets, can assign high confidence predictions to od inputs. So on the left uh is and like image images from, like.

A

You know um like a classic paper where they showed that, uh even though you show these images to like a classifier it will, uh the state-of-the-art dns will assign like very high confidence like more than 99 confidence to these images, which are completely like. You know. You you'd, be surprised that these are this assigning such high confidence predictions, and you can also construct completely like simple 2d.

A

Examples like concerns like the binary classification problem that I'm showing here where there are two classes shown in like blue and orange, and we have like od inputs and if you look at the class boundaries, the models exhibit. You know some sort of uh uncertainty near the boundary, but there are still like pockets of um inputs, which are very far away from relatively far away from the training data, where the model is very confident. You would expect the uncertainty to uh the model to be more uncertain.

A

As you move further away from the training data, so this is a desirable property that you might intuitively expect, but the models don't always satisfy this property. So next I'm going to hand it to dustin. But before that we can maybe um have five minutes for questions.

B

A

Looks like there's some questions.

B

Into that, okay.

C

Okay, yeah jasper, would you like to um I I can relay some of these questions to you. I think jasper has been answering them. um uh Yes, yeah.

D

I've been kind of answering dynamically but happy to address here as well.

A

Okay, uh could you point to some code? Okay, uh we'll have some code examples and pointers later in the presentation and is that gaussian noise, I'm not sure which part that was referring to. uh If you are asking about this picture, I think they were evolved uh to have like very high confidence.

C

Probably also on the question where you were adding uh noise to the um images, I think a few slides earlier this one yeah.

A

Oh, oh, I see. Okay, um I think this is probably short noise. If I uh yeah it's not gaussian, but also noise also looks somewhat thinner.

A

All right, okay, are there any other.

C

Yeah, please, if you have questions, ask them in the q a at the at the bottom of the zoom app or I think yeah.

A

Oh, I see somebody asked like how do you define accuracy in a regression task? That's a good point like uh I'm, mostly focusing on classification as the running example in this talk, uh but for regression you could also do um you could define if you have like a likelihood, you can imagine like you know the model predicts like a mean and a variance or something, and then you could evaluate the log likelihood under this probabilistic distribution, and you can have measures like mean squared or mean absolute error and so on, which are measures of accuracy.

A

How accurately the model requires the underlying function, and I mentioned like briar score. uh It turns out. You can also define um some like measures on cdf, so once you convert to cdfs it you, you can basically reduce everything to. You can reuse a lot of these measures, basically because they are all functions on zero to one and there is in particular, a very nice measure, called cumulative ranked probability score, which is measuring like difference in the actual cdf.

A

This is the model's predicted cdf, and that is like a it's very related to the brightest code that I presented.

D

A

Another question.

D

Would adversarial examples be considered od.

A

D

A

uh Yeah, that's an interesting question, I think of like adversarial examples as a worst case out of distribution in some sense, and uh there and any kind of like we ultimately do want to also develop solutions that are robust to adversarial examples.

A

A lot of like focus uh right now is on um you know it's kind of like.

A

I think it's important to focus on the problem, um but I also feel like the there are some if the average case performance. If the model does not even pass the simple checks that you would expect like, c510 versus svhn, uh it's going to be really hard to solve the adversarial burst case problem right like so. I I think it's important to make progress on these other benchmarks as well before we start focusing on the worst case. In some sense,.

A

But it's that's just my personal preference, if you're really passionate about that, you should absolutely work on that.

A

All right, okay I'll hand it off to dustin.

B

Well, um yeah- and I guess I'll take questions um after my um set and I'm sure there'll be some there'll be a little bit of time after the whole sequence of talks, so that we can asynchronously answer questions um so yeah. So I'll start by going into um sort of like the language and frameworks that we're using to sort of describe potential solutions to these problems.

B

um Here the next slide- and this starts from sort of taking out probabilistic approach, to understanding how we solve machine learning and statistics, style problems and the very high level overview how this works is that um if you're, imagining like some formalization of what the scientific um uh processes scientific method, is um you start with having some domain knowledge you formalize them. You start making assumptions, given that with your domain knowledge, you, you bake them into an actual model that models, something like a you know.

B

It describes a generative process or just it's just some function that takes inputs and returns outputs. um You have data that data comes from actually running experiments if you're in the 1700s or you have modern. You know amazon, turk or something to get together the data or you just scrape data from the web. Anything you can really count as data then, given those two sort of things, those two inputs, you um actually run an algorithm to sort of infer the in hidden structure. That's behind your problem!

B

So usually that's like a set of parameters that governs the family of distributions that you have in your model after running that algorithm you're able you're, finally able to make predictions and sort of explore sort of sort of see what your, what your your model can do um uh on arbitrary um inputs and outputs, um and this high level overview is really important, because this is sort of the foundation behind many fields that leverage data analysis and this pipeline.

B

What's one one thing: that's really nice about it is that it sort of separates the assumptions that you're making, which is part of the model building procedure, the actual computation that goes into running the algorithms and then the actual applications of leveraging these models.

B

um You can go to the next slide, and so this process is not really just a serial three-step process. It's actually a loop in that just like the scientific method, it's all about taking your models and really checking if these are actually fitting. The assumptions that you care about so here um the formal name for this is called model criticism, but more broadly in machine learning. This is how you just evaluate methods you you have a leaderboard or something of that sort.

B

You you check something like accuracy, you see how well that model fits the data and then you uh on unseen data. So you test that or something and then you go back to revise your model, so you um check specific assumptions like maybe my um maybe the way that I'm I'm I'm composing convo layers is not not really super.

B

Indicative of anything um next slide next slide, please so um so that was a high level overview and um where um publishing machine learning particularly fits in is how it involves the language of probability theory in the individual steps of that process. So really you can think of. um You can think of a particular model under the probabilistic approach, as sort of a joint distribution.

B

um So here this is all going to be in discriminative land because we're talking most about supervised learning here, so um we would have an observe. We have a distribution over observed, outputs y given x and there's going to be some.

C

B

um And the processing model is going to have it's going to be a joint distribution of those outputs in those parameters um and that's the model. So that's step. One step two is about making inference about the unknown. So in this case it's the parameters and the posterior. What we call exterior is this conditional distribution of the parameters given you're actually you're, observing data set- um and this is really this is not like a philosophical um statement or anything. This is just bayes rule that I'm applying here.

B

So this is the conditional distribution you expand it out. You get a joint divided by this marginal distribution and you can play around with distributions all you like, um and um one of the central things with policy. Machine learning is that for most interesting models, this denominator here this marginal likelihood is not detractable, um because it's a high dimensional, integral problem integration problem where theta here the parameters is probably millions to billions of dimensions.

B

So that's the that's the stage of how we're thinking about the modeling um step one and then the inference step. Two. uh Can you go to the next one.

B

Can you go.

A

To the next one, okay, yeah.

B

um So um so that's really the recipe for um the promising approach to um uh machine learning and how you might try to get at something like uncertainty, estimation, robustness, so here you're specifying the likelihood. Typically it's something like a neuron that ends up output distribution. So maybe it's a categorical likelihood. The gaussian, if you have uh continuous outfits or something more complicated than those two and you have a prior distribution over those parameters and given.

A

B

You step two is um choosing some algorithm to actually perform approximate inference and we'll go into um in more depth of how you um select these things and then step three. How you actually make predictions of exploration. um There are multiple approaches. um The most generic one is to just do modern, cardinal submission to sample from the distribution.

B

So here what we're looking at is um the distribution of of the outputs given an unseen input, so an arbitrary input x conditional on your data set and here what we're doing is we're sampling from the exteriors of p of theta and uh we're doing a monocar lesson. So an average over each parameter set each sample from the thetas and uh giving you the um the the distribution of the outputs conditional and that set of parameters, um and we can we it's pretty easy to work out um how this formula is direct.

A

But that's the that's the general recipe uh next slide. Please.

B

um So, um to connect this back to what people are familiar with, if you're thinking about just like how do neural network actually fit into this, you can think of this as sort of taking a point um as a point estimate like a specific set of your parameters to approximate that full posterior distribution, it's very simple approach works well and just a simple baseline for a lot of things that we're talking about um and the way you might want to select. That point is to choose the highest probability under that posterior distribution.

B

So here's a very simple one, slides or derivation for how you get neurological.

C

B

You start from trying to take the maximum of your posteriors, a p of theta, given x and y. You can uh equivalently you're maximizing the log likelihood, because log is a convex function. It preserves the mode um you can expand out the posterior, so this is taking the joint distribution so log of the joint- and there is a constant which is the marginal likelihood um for reasons um that may not be clear from the starvation that has a constant with respect to theta, because the marginalized does not depend on the parameters that you're changing.

B

You can rewrite the max zoom in and now we have the generic um a soft mask. You get a generic cross, entropy problem with some prior and as a special case of this, if you're doing classification- and so your likelihood is categorical- and let's say you have some prior on your parameters, it could be something like l2, so this is taking the gaussian prior with a particular standard deviation which is given by lambda here, um and so that's that's exactly the procedure here.

B

The special case of this is: is there a softness, cross, entropies, l2 and then with a specific algorithm to actually find the mode or the minima? uh This is um something like sgd, um so in the figure here, there's there's you know this there's a generic distribution, it's probably bimodal. So in this case it has four different modes and what sgd will do is it will try to find one of these modes and then I'll use that to make predictions uh next slide.

B

Please. So that's a general setup and now we'll start to go into methods that are a little bit more complicated than just.

C

Neural network sgd.

B

uh Next slide, um so the most natural one coming from just typical neural nets, is to think about. If we were to use non-degenerate, priors and non-degenerate procedures, so ones that actually have problem. We asked beyond a single point, and the two ingredients here is think about what that prior is p of theta.

B

There are many different choices um um for how you might choose at ps data, it's sort of shown by the future here on the right. um So there's like there there's a lot of different behaviors that you might want to consider from sparsity, so where how peaky this is, the distribution is at zero. um The tail behavior is how much quality mass is is on the um ends of the spectrum, so around three, um so you're higher or negative, three or higher, um and then after specifying the prior.

B

You have a family distribution, you're going to use to actually approximate that true posterior um and we'll go into how you select that and after you're, actually fitting uh you're you're, finding the specific uh family, your specific distribution that fits well to the posterior you'll. You might have something like the bottom.

B

um The bottom figure here where, um if you have a sufficient amount of data for this sort of interval from like negative 2 to 1.5 or something you'll, fit the data pretty well, but then on unobserved inputs and you've, never seen so like out of distribution inputs. Anything beyond that sort of interval, um your um your confidence, sort of grows, um other sub, behavior, linear, exponentially or something so. The uncertainty grows. The predictive standard deviations grows as you try to extrapolate, which is sort of the desired behavior. You want for uncertainty, estimation.

C

Next slide, please.

B

um So uh the first approach to uh trying to do um uh inference is with variable inputs. uh Visual inference is a fairly simple procedure. It just takes the way you do reference.

A

B

And you cast it as certain optimization problem, so you have the family of distributions here. This is a parametric family and a common common choice is something like a mean field or a fully factorized distribution. So here there's q theta it's parameterized by a set of parameters. Lambda and here we're just factorizing it just that there's q of theta, I theta is each weight element or each element in the bias terms, and you typically might want to choose directional distribution each of these variational distributions to be something like the prior.

B

So if your gaussian prior, for example, corresponding to l2, you might have a gaussian variational distribution for each of these terms and then, given that family you're going to optimize some loss function forming a divergence measure with respect to those parameters, lambda and you're, going to optimize it's just that it tries to be close to this here, and so that's that's sort of the caricature of how this figure works.

B

You start with some initial parameters for an iterative, optimization procedure like creating the sense, so it's new in it and then you, you traverse some optimization trajectory and you eventually get to something like new star here, reduce stars closest to kl, to p of z, q and x, which is the true exterior.

A

Next slide, please.

B

So this is the concrete loss function for how vi works.

B

The the loss function is taking expectation of your log likelihood term with respect to your true posterior or your approximate posterior cues data and your kl term um and um uh algorithmically. How this works is that you might do something like sampling from q, to monte carlo estimated expectation similar to how you might monte carlo estimate, the posterior predictive when you're doing test time predictions and then you're going to take gradients.

B

You just need backdrop through with hdd and most of the time it will work with like footnotes, um and um how might you interpret like what this sort of loss function is doing? Well, um you can think of the negative of this loss function as a lower bound to the marginal likelihood. So this is known as the evidence lower bounds. It's sort of the likelihood of interpretation where vi was actually first invented.

B

It's it's sort of like a very it comes from the em algorithm, where you're taking the the marginal likelihood which you want to maximize so you're doing something like mle and um you can derive it bound to this. Using an approximate exterior and then you're you're trying to fit best fit the approximate fit there. It's just that you can get a tighter and tighter bound on the true uh the two likelihood on here.

B

So this is about this- is uh um this is less than or equal for all parameters, lambda, and um if the true, if the uh variational posterior was um exact, so it was um equal to the true positive then this is um this is a this is an equal sign and not just less than like a strictly less than equal to inequality.

B

There's also a code length view from this coming from the minimum description length perspective or the coding theory perspective, which is that if you look at the first term here, what you're trying to do is minimize the number of bits you're using. So your the flexibility you have is in choosing lambda here, you're you're minimizing number of bits, you're that required to explain the data.

B

So, every time you evaluate this, um this log likelihood term you're, paying a certain amount of bits that are required to actually reconstruct the outputs given certain inputs, but as a trade-off, you don't want to pay too many bits, because the second term here is the trade-off where, if you move lambda too much, you might be deviating too much from the prior, so the kl penalty, the the the penalty you're paying for deviating from the prior, is going to be possibly too large.

B

So those are the two uh trade-offs that you want to make from the perspective.

B

Okay next slide, please um so the first thing on in terms of basically known, that's how you reselect the prior um uh there's a lot of details in in how we do this and it it is in fact, still sort of an open challenge, but the standard one you might want to do is just a normal prior with uh zero mean and unit standard deviation, that's sort of the default that everyone uses, um but it's not necessarily the best fryer to use and there's many reasons behind why we don't really want to use um standard normal priors.

B

In practice um you we can already look at this from sort of like this fiscal perspective, of like what the model looks like and what the, if you generate, draws from the model. What is it? What does that want to look like, um and it goes from like how we might leverage information within the network structure, what the asymptotic behavior looks like and whether that's sort of the behavior that we like to um you know how we actually um select the prior.

B

So if we have a specific gaming knowledge like if we want to encourage exploration, how we might do such the thing um and you can go into actual um actually trainability properties, so, if you're to actually use this prior um with sg, um what what sort of behavior does that cause sgd to like what sort of inductive biases does that have on sgd, and this ranges from the parameterization so like um uh centered on fires are not invariant to grammarization.

B

So if you just change the way, you uh parameters your normal net that can lead to very different end behaviors. When you optimize, you get your end solutions and in fact it's also just too strong a regularizer, particularly during the early stages of training. So if you, if you, if you've ever tried, um training basic neurological practice, what often happens is that um uh if you look at the gradient of the gradient signal's magnitude for the kl penalty, that is often much higher than what's required to fit the data, so the expected log likelihood term.

B

So also what will happen is that you'll, just collapse and you'll have the majority of your approximate stereo distribution just be added, be equal to the prior, so you won't really be fitting the data at all um and there's a lot of recent work that I've been improving. This um improving how we select the prior coming from thinking about priors in the function, space, thinking about uh exactly how inputs and outputs might operate with a distribution over this non-parametric space.

A

Next slide place.

B

So the second step was think about how I might select the approximate posterior.

B

This is sort of a um you know, a question that has been long-standing. It's been a question since we even fit started to fit public models thrown at back in the late 80s, um and as I was mentioning the comment, the most common one coming from the late 80s was in fact this mean field approach, fully factorized, where you might have a gaussian per element of your weight matrix and vice versa.

B

But there there's many different choices like you can do mixtures of of mean field distributions. You can do structure characterizations. There also hierarchical versions of these things, um so there's a lot of different strategies in literature and many individual publications are all about choosing like what is a better approximate procedure that works well with vi next slide, please so as an alternative to approximate inference with vi.

B

You can also do something with money: markov chained, money harlow, where, instead of having a parametric family distributions queue that you're, optimizing and and finding the set of uh you know the set of parameters in that family that best represents exterior. You can just do a non-parametric thing, where you're sort of just doing um more of a random walk behind the you're sort of exploring within the true posterior and you're collecting samples, as you explore in that space.

B

So uh here, if you're, just looking at what the test time behavior it is with when you're, making predictions you're you're just using a bunch of different samples, so you're using s different samples from theta and uh mcmc is sort of a way of how you might draw those samples and the way it works is by taking uh what's called serial energy, but in in the previous option, sort of in the previous slide.

B

This is sort of just the joint distribution, the negative joint distribution, uh a joint density, where your first term here is a likelihood. So it sums over each data point, and you have your prior and uh mcc- is just many different strategies for how you might carefully walk leveraging this energy function to better explore the full posterior and give you those samples next slide. Please um and mcmc for neural nets is also a very plastic thing.

B

I think it it sort of became the standard for how you might do um bayesian neural nets and probably modeling with neural nets um in the 90s. It was in fact um sort of uh winning on a lot of high profile competitions back in the day um and the um a lot of the ideas come classically from statistical physics.

B

They leverage um hamiltonian dynamics or launch in dynamics to give you ways to take a specific sample and move along uh move in the space, while preserving um certain behaviors that you care about at mcmc uh and there's, and I'm leveraging a lot of these ideas.

B

um It used to be the case that mcmc for deep learning really didn't work, because um mcmc wasn't as amenable as vi with stochastic optimization but it, but in recent, um uh but in recent literature within the past year, or so, there have been a lot of interesting um works that have gotten pretty good results with mcmc.

B

um But as a caveat here, um smc is not sort of the perfect solution. There are many tricks that are required to get it to work um and there's a lot of um there's a lot. There's a lot of impracticality with how you leverage these things, because um the procedure tends to be fairly expensive compared to sgd.

B

um So in one of the works, um it sort of takes a thousand different epics of training for something like cfar, as opposed to maybe like 200, for deterministic training, and you have to carry thousands of copies of resnet of your architecture to make good predictions.

B

um And uh the next slide will be about um sort of be like about simpler bass. Lines is sort of like a step back from the sort of higher, uh very complex strategy of how you um pick different things in the recipe, but before the before I hand that out to jester, maybe I can sort of just take questions and start answering those.

D

Dustin one of the questions was: um can you please provide links, slash pointers for learning the topic, and that came up when you were talking about priors and posteriors yeah.

B

D

B

That's that's a great question. I don't think there is a good canonical link or a canonical paper for selecting priors. um I think the neo 1994 um paper and when we shared the pdf for this, um you can get the actual paper from it. I think, if you search like rad for neil priors for infinite neural nets or something like that, it goes into a bit more depth of um sort of the asymptotic perspective of if you were to take a note on that and go the infinite with.

B

I think you can also go into just the um sort of first paper with neural nets in backdrop and vi in literature, so that would be a blundell 2015.. uh It's sort of an icml viewer. That's pretty good at describing some of this and there's a there's, a slew of a lot of recent literature. So, in those cases um uh once uh the pdf is available, you can never check out a lot of the more recent papers that study priors.

D

Also, some early work by david mackay is is really great um kind of setting the the standard for thinking about these things yeah. I agree.

B

uh Jasper would it make sense to use the prior which resembles the distribution that we're trying to model, and in that case, how can we get the underlying situation? Aha,.

D

That's a really great question: that's like uh it's something I think about quite a bit um which is like. Yes, if you know what the structure of the problem is, so maybe like the form of the function you're trying to regress, and ideally you would specify a prior that takes advantage of that with deep neural nets. That's harder, particularly with structured inputs right. We.

D

We do certainly capture something like that with data augmentation, for example, where we're saying effectively we're imposing a prior by saying um our model should output effectively the same thing for slight rotations of an image or something, but in neural nets. It's really hard to specify a prior on the form of the function, it's kind of implicit in the architecture and the initialization and lots of stuff around it. I'm going to talk about gaussian processes in a second and there.

D

I think it's very it's very clear and neat how you specify a prior over functions, but in neural nets. We typically kind of just say we give up and we think the prior is the weight should be small, so zero mean gaussian. Even though.

A

We know that that's.

D

Probably not a great prior.

B

The next question is: can you explain a little bit more on hamiltonian mcmc?

B

um I think the uh the best way to describe this is that um without going too much into sort of the equations of it, um you're leveraging chemical dynamics to sort of preserve um um some properties of the specific sample that you're you're you're using there's, um there's particular ways of of leveraging um uh something known as sort of like the the leapfrog integrator to actually um propose the next step that you're doing um there's a lot of sort of complications.

B

How like leveraging sdds to sample um to get new points um works with this sort of um uh uh from by transitioning. From that point, um there's a lot of like, for example, there's like discretization behavior. That's that there's sort of discretization that you have to do and that sort of causes inaccuracies.

B

So the ultimate procedure that you might do with um leveraging these sort of different equations is to use it within a metropolis hastings procedure where you're actually proposing and then you're you're checking, if that, if that falls under a particular ratio and you're accepting you're, rejecting that sample and you're coming and so on and so forth, I think there could be um an entire lecture on just mcmc.

B

So I think if you definitely want to learn more about mcmc here and in particular, if you're interested in that sort of research, I would recommend ian's in survey about this.

B

uh Next question is: how would you select? How do you pick your prior if your data has multiple different distributions in it? um I guess it depends on what you mean by multiple different distributions. If you have a data set, um there is sort of like one distribution which it has, but it might have something like multimodal behavior or you're. You might be thinking about um the extrapolation behavior, where the training set that you're using is very different from your test set.

B

um I think a lot of I think that's a great question, because that sort of point that that leads to why just using um a standard normal prior over each weight element is sort of a bad idea, because it doesn't give you any way to actually like use what you're thinking about like.

B

If, if you, if you have a if you have a better sense of like if the distribution is multimodal or um if my, um if my uh distribution, that I'm trying to make predictions on is like um actually within the distribution, but it concentrates more on a certain area.

B

Those things are better encapsulated through functional priors, and you can think of these, as um I think I think the best answer in terms of how you might think of functional priors is um something that jasper also talked about um when he goes into guessing processes.

B

C

Question is what are your.

B

Thoughts on uq methods based on gradient perturbation, um I'm not sure what particular method you're talking about. There are adversarial methods that use gradient perturbations to be robust to certain examples. um Those and those are evaluated on atmosphere, examples which are a form of od.

B

um They don't work as well on inexplication, so they might solve this sort of worst case behavior when you're looking at sort of an epsilon ball around your certain inputs that you're testing, um but if you evaluate on, if you look at like how they perform on standard um leaderboards on um a corrupted version of c4, imagenet or or a natural, more natural version where you you just run, you take the internet data gathering process. You just run that process. Again you get a new data set.

B

um Those behaviors don't work super well, but I think they're. They are super exciting in the case where we do care about the worst case, behavior.

D

Stochastic gradient, um london and dynamics or casting creating mcmc can be thought of as gradient perturbation where you just do stochastic gradient, but you add a little bit of noise to the gradients at every step, and that has the effect to kind of perform. This random walk trajectory um so that that is one way to to get a a sample from the posterior through breaking for the patient.

D

A

Thoughts on that.

D

Are I like wish, I really wish it worked and I've tried really hard to get it to work, but it's really hard to make that work, and it takes a tremendous amount of samples.

B

Oh yeah and now I'll hand it off to jasper.

D

All right just give me one sec.

D

I've got one laptop with where I can install zoom, but it has a bad mic and then another laptop. That has a good mic, but I can't install zoom, um but you can all hear me. Okay, somebody indicate yes.

C

D

Very um belgium, can you click the next slide? Okay, so um dustin talked about most of the the rigorous methods and I get the the honor of talking about somewhat more more heuristic things.

D

So um we'll start out with recalibration, and so a really simple idea for getting better uncertainty, you might imagine, is just like explicitly recalibrate the model and one way you can do. This is something called temperature scaling, which is basically you take a held out, validation, set or an out of distribution set, and you rescale the temperature of the output distribution. So you basically like either smooth out the output probabilities or sharpen them to uh to match the that validation set. More that's something called temperature scaling and you can see it's just optimizing.

D

This one temperature parameter on the softmax and it tends to work quite well on the in distribution data. We've actually found empirically that it doesn't do very well on out of distribution data. So you don't get this epistemic uncertainty or data uncertainty, but you.

B

A

D

Alliatoric uncertainty from that, but it's a strong baseline and it makes sense to do it next slide.

D

Okay, then another strategy- that's been very popular, is um yarengal and zubin garamani had this paper, where they proposed effectively kind of approximating an ensemble through through dropout. So if you're familiar with dropout right, it's a regularization technique where you stochastically drop out hidden units during training, and the innovation here is to keep this stochastic these stochastic units at test time and average over the predictions.

D

So you basically drop out units when you're predicting and then average over the predictions, and that gives you at least better uncertainty than uh than not doing anything um and is also a pretty competitive baseline next slide.

D

Okay, so the the next thing is deep, ensembles, and so here the idea is basically just rerun standard, stochastic, gradient training uh or whatever um optimization method you you like, but with different different random seeds and then average the predictions of the of the different models you end up with.

D

So this is kind of a really well known trick for getting better accuracy. It's super super common on kaggle, the top scoring teams on kaggle always ensemble, like a whole bunch of different methods.

D

In my view, it's pretty unsatisfying because we're relying on the fact that the lost landscape is non-convex and even though we're using a convex optimizer, it takes us into different modes, and- and this gives us kind of qualitatively different models and then by pure chance.

D

These are diverse and we get a nice diverse set of predictions balaji along with his co-authors, uh tried this out, so they tried a whole bunch of things and were surprised at how effective ensembles were and have a fantastic paper kind of talking about why this is the case in that simple and scalable predictive uncertainty. Estimation paper next slide.

D

So uh belgium showed you this picture uh above earlier talking about how accuracy degrades, but then, if you look at calibration, so these are our corruptions on imagenet um looking at calibration. It also gets worse and that's from a benchmark paper. We ran a year or so ago and uh and there, the kind of the shining thing that that did really well- or maybe I should say, didn't do quite as badly uh in terms of calibration was ensembles.

D

um They kind of outperformed all of the approximate bayesian things we did and this temperature scaling stuff that I told you about just now all right next slide.

D

So why do these work well in practice? um Oh bellagio.

C

D

A slide where the animation is gone.

A

uh Yes, unfortunately, I come back to pdf and I think.

D

I would ask you to imagine a beautiful animation where it says space of solutions, but maybe I'll try to describe what it is. So um you might ask why do deep ensembles seem to do better than than these very rigorous things like variational, inference or or much markup chain, monte, carlo, that dustin talked about, and the one of the reasons is that these approximate bayesian methods in general tend to kind of um start in a random place and then find a mode and then locally explore that mode.

D

So you could imagine like a bunch of different basins or bowls next to each other, and these methods kind of go into one basis and then locally approximate, that and ensembles kind of start in different places and end up in different basins, and it turns out that getting into these different basins seems to be really important to get a diverse set of predictions.

D

And so that's actually a really interesting observation that that was really brought out in this paper called deep ensembles, a lost landscape perspective by stanford and uh and bellagi as well.

D

So in the right panel there imagine a beautiful figure with multiple basins, multiple optima and variational inference getting into one mode and kind of moving around, whereas ensembles kind of get into the different basins.

D

A

D

D

Okay, so ensembles seem to work really well, you might ask kind of like why not just do ensembling all the time and unfortunately, ensembling means you have to carry around a bunch of copies of the same model which, for some purposes is totally fine.

D

You know for our purposes it where we like want to serve a giant model at very high throughput. Then you might imagine that's undesirable right. You need to do a forward pass through each model and typically we found that we try to get the biggest neural nets we possibly can fit into into memory, and that makes it impossible to carry around multiple copies.

D

So uh oh yeah and bayesian neural nets also seem to be very promising, but mcmc. For example, you need to carry around many samples as well, so you also have to carry around all these different models um and so uh balaji and dustin, and I have spent a considerable amount of effort thinking about how do we kind of approximate these?

D

These approaches in in cheaper ways, um along with uh other fantastic researchers in the community, uh such as aaron, uh andrew gordon-wilson and iron gal, and more next slide, please, okay, so one of the things that um that actually dustin and collaborators ethan when ed al came up with was what, if we take a standard neural network and we say that each layer has a rank, one factor that gets multiple multiplied by another rank, one factor that then forms the size of the weight matrix and then multiplicatively multiplied by the weight matrix.

D

And so you can imagine that this is kind of modulating the weights of the ons of the single neural net, such that. If you have multiple of these rank one factors, then you can have multiple effectively different paths through the network from the bottom to the top and then averaging over a bunch of these random vectors multiplied by the network kind of gives you an implicit ensemble that you can then compute in a really efficient way, just using batching next slide.

D

So we took this this idea further and said: well, can we come up with a bayesian interpretation of this and um and actually placed a a variational posterior on these rank one factors, so we place a distribution on the rank, one factors, and then we optimize them via the elbow that dustin told you about, and there then the product is you have this model with a distribution over rank, one factors and you can sample rank one factors that then modulate each layer of the network and kind of modulate the path through which uh the data goes from the bottom to the top and averaging over a bunch of these samples, implicit ensemble.

D

Members then gives you a diverse posterior overpredictions.

D

One way to get even closer to ensembles is to say you have a mixture distribution over these rank. One factors, so you sample rank one factor from a mixture of rank one factors, and you could imagine each mixture kind of corresponding to an ensemble member, and so we found empirically that this performs really well and actually gives better calibration than even a standard ensemble on a bunch of problems, uh while incurring only a slight addition like a tiny addition in in terms of extra parameters, next slide.

D

Okay, then, the the last method, I'll tell you about, is maybe the one I'm most passionate about, but also probably the most complicated.

D

um So there there is a particular instance in which we can actually compute the marginal likelihood. So this integral that dustin told you about as well, analytically and not have to approximate it, and that arises if we assume that there's a gaussian distribution on the likelihood a gaussian distribution on the prior, then multiplying two gaussians gives a gaussian and an interval over a gaussian gives another gaussian. So we can do everything in closed form and compute.

D

This interval- and it just gives you a big gaussian in the limit of infinite basis, functions you you actually arise at this covariance, which is the covariance of the ultimate gaussian that you end up with, and we call this a gaussian process.

D

I know that's a lot to take in, but this book by rasmussen and williams is, I view as seminal literature on the subject. Was my favorite book in in grad school? Definitely recommend you read it.

D

So what you end up with is a flexible distribution over functions, this giant gaussian, it's specified now by a covariance function over examples. So if you're familiar with the kernel trick, then that's exactly what's happening here. um The kernel becomes the covariance matrix of this big gaussian effectively and the parameters disappear entirely because we've integrated them out um and then, if we condition on data, we get a nice posterior on functions. So on the right there, you can see samples from a gaussian process prior so sample functions and then condition on data.

D

You get a posterior over functions next slide.

D

A little more formally so gaussian processes are distributions over functions from some space to the real numbers. You can say that the observations of any set of points are jointly gaussian.

D

That was interesting. Google assistant just answered a question that I didn't ask for some reason: um okay, so they are specified by a mean function and a covariance. The covariance is that that kernel, that I told you about and we can compute effectively everything that we want, analytically, so the predictive mean and covariance given observations is this equation here this mu, which involves unfortunately inverting this kernel, this covariance matrix and then the variance of predictions is below, which also involves the covariance between test examples and the training set and the confidence between all training examples.

D

So you might look at this and say: oh, I have to compute a coherence matrix over my training data, which is right, which is n uh n squared and size, and then I have to invert it, which is cubic in the number of operations.

D

So uh gps are typically only used in very, um very low data regimes, they're kind of seen as the the state of the art or the gold standard and getting good uncertainty estimates especially for regression, but because of the scaling issue, they're they're kind of limited to smaller problems.

D

um Intuitively there are prior for smooth functions, um similar outputs should have similar inputs should have similar outputs and, and we can compute all the quantities that we want uh easily. Analytically next slide.

D

Okay, so you might wonder why are you telling me about calcium processes? This is about. This is a deep neural net lecture and the reason is that in the limit of infinite width- and you assume the gaussian prior then integrating over your parameters, which gives you good uncertainty and is the kind of the bayesian thing to do converges to a gaussian process.

D

So this is a seminal result that again came from bradford neal's phd thesis, which is kind of this amazing tome of literature that he produced uh the you. Can you can kind of think of it as saying that the covariance matrix the kernel is basically just a covariance taken over the hidden layer activations, so you just take the inner product, basically of the hidden layer, activations, and that gives you the distance or the similarity between two examples.

D

The caveat, and maybe the hard part to understand is that you have to marginalize. You have to take the sorry, the hidden units to the limit of infinite different hidden units for it to really work out all right. So then, uh to give a little history.

D

Chris williams came up with uh with an actual covariance function for a particular kind of neural net in in 97 and then very recently, there's been a lot of renewed interest from the deep neural net literature, and so there's a couple of citations here, but there's been a bunch of fantastic work, establishing the connection between gps and deep neural networks at the infinite limit and coming up with the gp equivalent of interesting architectures like convolutional networks and so on.

D

um Along with ben adlum, I submitted a paper to nurps where we looked at the how good the uncertainty was of these particular calcium processes, and it turns out it's pretty fantastic, um they're quite well calibrated so doing.

A

D

Thing the right thing is is: is actually paying off there and I'd love to point you to a fantastic library called neural tangents, which was put together by some colleagues at google. If you want to play play around with this next.

A

D

C

D

Am I doing on time.

C

uh We have six minutes left, but I it's fine. If we go a little bit over time, I think it's.

D

Completely okay! Well, I will skim over this, but I will just give you an overview. um Okay, next slide.

D

Okay, so um this is essentially one way to get good calibration is to come up with kind of implicit, priors or inductive biases that uh that you expect would be kind of out of distribution data that you would see and so data augmentation is a good way to do that and that's something we've we've been exploring and tends to help calibration significantly next slide.

D

Another thing we're really interested in is trying to come up with a drop-in replacement for standard neural networks. So, instead of having to like follow this complex machinery or carry around an expensive model, wouldn't it be nice if you could just take your existing model, augment it in some way and then have good uncertainty and one way when doing that is to slice off the top and stick a gaussian process on and there's there's certainly some some kind of complexities to how to make that work.

D

Well, um but that's something we're really excited about and and seems to work quite well in practice next slide and then uh ensembles. You could imagine you could go a lot further than just um doing random initializations.

D

You could actually impose diversity on the ensemble through a regularization, for example, or impose diversity on hyper parameters of the ensemble or have ensembles of models with complementary hyper parameters, and these are also a couple things that that we've been focusing on kind of quite recently in recent submissions to conferences next slide: okay and then we'll, oh, so one quick pointer, I know somebody asked: is there code that we could point to where they could try running, for example, c410 versus svhn?

D

And the answer is yes, the there is we are in within our team at google. We are open sourcing as much as we possibly can so there's some great code uh to specify models and run them in edward too.

D

Then, a a code base that we just open sourced, called uncertainty baselines which contains a lot of of pre-made models effectively to uh to run across including a bunch of benchmarks, so that imagenet c that we talked about and a whole bunch of others.

D

If you want to like try a new model and run it across a bunch of uncertainty benchmarks, then then you can do that using that code base and then uncertainty metrics, another code, library that we just open sourced, which contains kind of canonical implementations of things like briarscore and ece, so that everyone can kind of share the same implementation of metrics next slide all right and then we'll we'll finish off with uh with some open challenges, things that that we're thinking about um so one is uh thinking about.

D

You know why: why does the bayesian method or approximate bayesian methods don't seem to be used in practice that often without some amount of kind of heuristics or hackery to get them to work?

D

And so um florian wenzel sebastian was in and a bunch of others, and I um put this paper up on online exploring why that's the case? So why does the bayesian approach seem to not always outperform the non-vision approach and there's some? I think some really interesting kind of technical challenges. We need to get past to answer that uh what are good priors so dustin talked about that. What's the role of the choice of architecture, hyper parameters, heuristics like batch storm? Are they bayesian? Are they not basin?

D

Do they give better uncertainty or not? How do we efficiently marginalize over high dimensional neural net posteriors, so better approximations, certainly is a strong research area right now, getting a better understanding of od, behavior and um and formulating kind of a more rigorous bayesian interpretation of deep ensembles, so bellagi actually has a paper.

D

um That is, I think, on archive now that um that really tries to pin down this question and then we need better benchmarks. So we need realistic benchmarks that reflect real world challenges, which maybe some of you have immediately have ideas where you know. If you get better uncertainty on on your problem, which is like a real scientific problem, then it's it would be really meaningful and it would be really useful. I think, for the for the community to develop across those those benchmarks, all right next slide.

D

Oh yeah, we'll just skim through these, so we've we've appended a bunch of references at the end of the talk for uh for you to to look at.

A

Yeah, I think also adding more as we speak so we'll send like a new version of the slides. After the later today, yeah go ahead.

A

We have a lot of references in the intermediate slides when we present the stuff as well. We'll uh add them back here as well, so that you can find them all in one place.

D

And with that, I think we can.

A

Take some questions.

D

Maybe I'll sorry, I need to jump between computers to be able to see the questions. Yeah.

C

I can, I can agree with you yeah. I.

D

C

Read: yeah: okay: okay, go ahead.

B

Just for the first one for you, how do we interpret with neural nets? Do they have an interesting number of parameters.

D

uh Yes, they they effectively do so the it's. If you're familiar with the kernel trick, then then you probably wouldn't be asking this question. So maybe that's not the right way to answer it, but um effectively they have infinite number of parameters and the way that it works is you can actually compute the integral.

D

So what you want is the covariance between examples over the the last layer of and that's for example, and so what you do is you say I have theta of x times, theta of x, prime, uh the inner product of those two, and that will give me the covariance between these two examples at the end of the neural net.

D

Then, if I compute an integral over that from negative infinity to infinity effectively marginalizing over all possible weights, then I can actually compute that integral the integral over that inner product, um analytically, which is exactly the construction that happens um for most of the the kernels in svm.

A

D

And most of the covariance functions or kernels and calcium processes.

D

It's it's clearly like it's. It's a pretty nuanced thing and super elegant once once it kind of like clicks, but it might take more than uh and then a few minutes to explain in in all its glory.

B

So next question is: what will be the selection strategy for approximate infants approaches among mcmc, vb and ensembles? I maybe.

A

I can just take that one.

B

So I think the best way to sort of choose could be sort of if you just look at. um Actually, I think this is this is done pretty well in terms of the leaderboards that you have in their open source code. um Ultimately, what matters isn't really the like algorithmic approach, but things like compute, so how much compete you have what sort of assumptions that you're making with the model? um What what can you like it? Can you better impose in your model?

B

um So, um given those things you can sort of just choose the the top one, um but of course, if you're doing, if you're doing research like methodological research, you can always just sort of choose your favorite one see if you can advance it a little bit more uh best solutions for you. Jasper. Do gps work well in modeling temporal data that is irregularly, sampled,.

D

Good question, so um I would, I would say yes, but with a caveat, so gaussian processes. I think a previous question kind of alluded to this right so like how do you specify a prior over the over like the function that you care about and gaussian processes? Give you a really nice toolkit to do that effectively.

D

You can basically say like here, is kind of the model of uh of the space that I want to of the kind of functions that I might imagine seeing and in gps. We do that by specifying a kernel function. That's like maybe I think it's twice differentiable and really smooth, or maybe I think it's periodic and there's a coherence function for that.

D

You might also specify like okay. I think it's a step function and you can compute a kernel which is like the inner product of infinite step functions, um and so you can kind of really carefully specify what your model is using a gp, and so it may be like. I think it's a dynamical system with uh that is irregular, not regularly sampled and- um and you can carefully specify that model.

D

So um I would say like if I was modeling that data I would use gp and mount a neural net, probably unless it's a a large data set, but I would very carefully specify the model and think about it carefully before trying to apply like a standard gp with a standard kernel.

B

um I'm not sure how much additional time we have, but the next.

A

B

Is is for boss? Maybe we could end there or figure it out, but the last question is about why ensembles are not popular in deputy models and should they be.

A

So that's a great question, um so there has been some work on like using like bayesian influence for generative models as well as like on some uh generative models. uh I I think the methodology can definitely do like integrating our parameters is a quite general principle and you can't do that.

A

There's a parallel literature which we didn't actually there's a lot of things that we didn't go into this talk and one of the things is the role of like generative models and order of distribution and detecting od inputs and so on. So I, uh the short answer is the initial experiments. We found some surprising results in making these ideas work and but now we've been making some progress and I'll add some references.

A

I I shared a slide that the zoom q a summarizing some of our works earlier this year, but we also have some recent work that also has been trying to get to the bottom of this phenomenon. So uh maybe that could be useful for folks and if we didn't answer your question, please ask it on slack channel or feel free to email us as.

C

C

Oh, there is yeah, maybe one last question that I think actually is interesting uh in the infinite limit of training, the model uq would be reduced. If would the model uq be reduced?.

D

That's a a great and- and I think very loaded question, um so you know I guess the in the theory on on stochastic gradient descent. It.

A

D

You have kind of an infinitely small learning rate or an infinite decimal learning rate and you run forever. You will converge to an optimum and I think the answer is yes. At that point, you will um will probably have better a worse uncertainty.

D

um There's you know there there's a bunch of work, studying things like early stopping. So if you hold out a validation, set and you're training and you watch the training curve get better, but you watch the validation.

B

D

Get worse then, early stopping says, stop right before the validation error gets worse and it turns out that that generalizes much better than like training to an optimum. So that's maybe one example of where that's true.

D

Beyond that, I guess certainly all of the mcmc literature would suggest that um you should maybe keep training forever, but add noise to your to your model. So it's um following a markov chain through the the um loss, manifold or the posterior effectively.

D

Does that seem reasonable dustin? Do you have stuff to add.

B

Yeah I tell you that.

C

Okay, I think we had so many questions and yeah. This was a great lecture. Thank you again, bobaji, dustin and jasper. um I certainly there's also a lot of material to check after the lecture, and maybe we look at the code and also look at some of these seminal papers. um So thank you again for for putting all this effort preparing this great material and great lecture and.

D

Thanks to everyone.

C

Yeah, thank you and thanks to everyone for uh joining uh today's lecture, and if you have more questions, please feel free to ask them on the slack channel relevant.

A

C

This uh lecture- and uh I think blaji is already there. So if you have questions, I think at least bellagio will be able to answer, and maybe jasper and dustin can chime in at some point. um Hopefully, the the slides are up on the website and I'll update the slides once village has the links for more material.

C

Okay, thanks again and we'll see you at next week's lecture.

A

Yeah thanks everyone.

C

A