National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 12 - Generative Models - Emily Denton

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Yeah, so generative modeling is a really really massive field. I'm gonna try and fit a lot of different content into this hour-and-a-half. My strategy is really to try and give you some good intuitions give you kind of like a big picture, look at the different sort of frameworks that people have come up with for different for doing generative, modeling and then also provide a lot of different kind of links to papers and other tutorials and references. So just to start out.

A

The basic kind of thing that maybe you've learned so far is supervised. Learning where you have a data set and your data might contain some inputs X and some labels Y, and you want to learn a function. Mapping X to Y another way of framing this is learning a probability. Distribution over the labels conditioned on some input X. So this is the kind of what traditional set up unsupervised. Learning, in contrast, really just looks at the data, so we typically don't have any labels.

A

The goal is just to uncover some kind of hidden structure in the data, and so this goal is super vague and that's kind of on purpose. Really it's the the end goal of what unsupervised learning is is is different depending on what you're trying to do, but the basic idea is to try and understand some kind of structure your data, so generative models fit under this overall paradigm of unsupervised learning.

A

So today, we're gonna be focusing mostly on parametric generative models. So there's a lot of different types of nonparametric generative models, for example, in the image domain, you have models that might copy patches from training images and kind of synthesize these in different ways not going to looking at those at all.

A

Instead, we're going to be looking at classes of models that are parameterize by some function and want to estimate the parameters from some data set, so the basic setup is we have some data set in this case, a data set of faces, and we have some kind of prior knowledge that we're going to inject. You know this could be as little as this is what my function class is.

A

We could also add additional structure to our network network, whatever we want and then we're gonna learn in some way which I'll go through in a bit, but basically the goal of learning can be framed in a of different ways. So one thing we might want is for samples from our data set to have high likelihood under the model that we've learned. So this is kind of a density estimation type framework. We also might want samples that are synthesized from our model to reflect the structure of the data distribution.

A

Obviously these two things are intimately connected, but sometimes we'll actually end up optimizing more for one or the other. So just to like motivate this a little bit, why do we care about generative models? So there's a lot of different reasons, so, first generative models are really good at helping to uncover hidden structures in data sets. So this is an example of a paper that came out a couple of years ago called info Gann and they learned a generative model.

A

In this case, they trained the generative model in these little like chair images, but in a purely unsupervised way the model was able to uncover really high-level factors like the chair pose and the chair structure and the chair widths, and things like this, and so this this image here just kind of shows synthesized images that are, you know, varying along these different factors of variation. There's a lot of really cool image, editing applications that have more emerged from generative modeling in recent years. So this is an example of super resolution.

A

Where, on the right here we have, can you see my cursor I know you can umm here we go? Oh you can so here we have the original image, and then here this is this or super res version provided by the network. Generative models can also be used to synthesize predictions of the future. Given some data about the past, so here in the top I'm showing some predicted future frames of a video of a robot arm, that's pushing around different objects on a table, so models I can perform.

A

This kind of future prediction are really useful when building agents that need to have a reason about the effects of their actions in the world. People have also used these types of models to help with sort of exploration you can kind of imagine being like, oh, like you know, do I have a good idea of what's gonna happen when I perform this action, if not, maybe I should try and explore this space a little bit better, so density modeling is also another way of performing outlier detection.

A

So here you can imagine if you have a generative model of this like data set of Street images and you get some new image, you can look at the likelihood of that image under the model that you've learned and try and understand. Is this a really likely event or really unlikely event and then use this in some other downstream application?

A

Generative models are also really useful tools for artists. This is some really cool stuff that came out of a Google team called magenta and it's basically a music synthesis model, and so this is they've. They've turned into a whole bunch of different tools that artists can use, and it's just really fun and cool so play with that. If you want okay, so now, I'm just going to go into some like background material that will come up throughout the course of this talk.

A

So kale divergence is a measure of how how far apart two distributions are: it's not a proper distance metrics, it's not symmetric, but it does kind of measure the the difference between two distributions and the reverse KL. So you can see in the top and bottom here we have K LOVE, P and Q, and on the bottom we have K LOVE, Q and P, and these are two different ways of measuring distances between different distributions and I. Bring it up here, because you can see that the different distances sort of emphasize different things.

A

So if Q is our model and P is our data distribution, then optimizing the forward KL, the one on the top is going to emphasize that our model kind of puts a density everywhere.

A

The data lies- and you can see in this case here- the the data is multimodal and the model this Q, that we're fitting is, you know, unimodal, and so we end up kind of putting density where there is none in the data distribution and then, in contrast, if we were to optimize the reverse KL, we might actually sacrifice one of these modes, because the river scale is not going to want to put any density where there is none.

A

So we might end up with higher quality samples if we were sampling from this model, because we wouldn't end up sampling from regions where there is no data, but we may draw a mode so I'm just throwing this out here, because it'll come out later when we're talking- and this is a good intuition to have- and then they Jensen's Shannon divergence is a third divergence. It'll come out through, and this is a symmetric distance metric. It also tends to emphasize not putting any density where there is no data at the expense of dropping modes.

A

Okay, so another thing that's going to come up is this difference between explicit and implicit models so likelihood based methods, also known as prescribed probabilistic models typically provide an explicit parameterization of a log likelihood function and parameter estimation here: proceeds in a very standard way that I'm sure you've learned through it. The you know earlier price of the summer school.

A

We just do maximum likelihood estimation, so we're estimating the parameters that maximize the likelihood of the data that we have access to through our training set under this model that we've defined and contrast implicit probabilistic models, don't need to define an explicit likelihood function. Instead, they just define a sampling procedure, and the intuition here is that we're going to learn this.

A

This generative distribution by just comparing samples from our generated distribution and the training distributions we have access to so another concept I want to introduce is the idea of latent variables, so in two intuitive level, latent variables can be thought of as explaining the structure in a given data instance by some latent variable Z. So throughout all of this I'm going to use X to refer to the data. This is referred to as observed because we have access to it at training time and then Z is gonna refer to these latent variables.

A

These sort of unobserved factors that sort of are causally related to the things in the world. So again, the idea is that, like these are gonna describe the underlying factors of variation your data set. So if you have a data set of faces, as described here, you might refer to the factors of variation. As things like you know, lighting condition suppose face hairstyle identity. The person things like that. The idea is that these latent variables are gonna kind of concisely, represent those different factors of variation in your data set.

A

So another relevant concept is directed generative models, and so, throughout this entire talk, I'm only going to talk about directed generative, but I want to bring it up, because it's not the only type of generative model that exists out there.

A

Basic idea, with directed generative models is that it's a family of probability, distributions that represents random variables in terms of conditional probabilities, and so here, if we have Z our latent variables, we can think of this directed model of we have a likelihood over Z, and then we have the probability of x given Z and we could have a whole chain of variables going down if we had, you know, said ones at twos and threes at four and so on, and so the basic idea of directed generative models- and this is going to come up through most of the models that we work through today.

A

We have some prior distribution over latent variable Z. Typically, this prior distribution is going to be defined as something that is tractable. A really common choice is just a diagonal. Gaussian there's lots of more sophisticated choices you could make, but in most of the examples where I work through today, we're just going to assume simple Gaussian and then this P theta of x, given Z.

A

This is typically referred to as an observation model, and so this model here again is typically taken to be something that is tractable easy to compute easy to sample from, and it's in all of the cases we're going to look through here. This is going to be parameterized by a deep neural network and this kind of sampling procedure, where you first sample a latent code from the prior distribution and then sample data instance from this observation model conditioned on this latent variable, that's called ancestral sampling.

A

So don't worry if this is all really fast, we're gonna kind of come back to it throughout, but just want to like plant the seeds and then so. Another thing that I, we might think about doing with with generative models, is actually trying to infer given some data instance. What are the latent variables that caused this data instance? So this distribution here P of Z, given X, is two Billy referred to as the posterior distribution over the latent variables.

A

It's often really hard to do this exactly as well see see throughout our models, but this can be useful. For example, if you like a generative model, and then you want to actually use the latent space that you learned for some downstream tasks, maybe some discriminative task, maybe some kind of clustering task, what so forth so cool. So that's all that kind of like background stuff that we need to know now basic idea. We have a data set.

A

We have a parametric model, P theta and there's a couple questions we might ask right, so we can say, like you know, how are we actually going to represent this this? This probability like what is the function class we're going to use and how are we going to actually learn the parameters and again, as I mentioned earlier, there's a lot of different sort of goals that we might have moving forward here right. We may want samples from the data set to have high likelihood under the model that we've learned.

A

We might want samples from our model to kind of reflect the structure of the data distribution and again this notion of like reflecting the structure and being good quality samples. This is often hard to quantify, and then we also might care about you know representation learning, so this kind of comes back to you know. Often generative models are learned with the goal of actually learning a nice clean, latent space. That can be useful for something else. So that's something we might want to think about when we're actually training the model okay.

A

So this is my attempt to kind of summarize different types of models: I've broken it up into explicit density models and implicit density models, again the ones with an explicit density we're going to learn through maximum likelihood estimation these implicit density models, we're going to learn by basically comparing samples from the data set and samples from our generated distribution and then within these kind of explicit density models. There's two classes of models, there's ones that define a tractable likelihood.

A

So this means that the log likelihood of the data under our model is something that we can optimize exactly and there's a couple different examples here and then non tractable models or ones where we we have a likelihood function, but we can't optimize it exactly and so we're going to rely on some kind of approximation so of these models. These are the ones that I'm going to I'm going to cover the most autoregressive models, flow based models, variational auto-encoders, generative adversarial networks and matching networks.

A

I chose these because they, you know kind of represent for the most part, the state of the art and generative modeling you'll see them a lot and yeah. You know cover this so cool, so I'm gonna start with variational auto-encoders, so high-level, summary of variational auto-encoders. These are directed latent variable models. They rely on likelihood based learning. The exact likelihood is intractable, but we're going to derive a lower bound on the likelihood there's an efficient ancestral sampling procedure. Again.

A

This means we sample Z from our prior and then X, given Z and there's an approximate inference game so just kind of pictorially. What this looks like again, this is the generative process. This is the standard, ancestral sampling and this likelihood is intractable, so we're gonna optimize a bound instead.

A

So these are what the different components look like this prior I'm gonna take for now to just be a diagonal Gaussian again, you could throw in something more sophisticated, but for now, let's just assume that our observation model as a conditional Gaussian.

A

So what this means is we're gonna have some Network mu, which is going to take in a latent code and produce a mean, so this mean is going to be the same dimensionality as our data instances, and then that is going to specify the parameters of our observations model, which in this case is a conditional Gaussian. You could also imagine learning the covariance matrix or just the diagonal.

A

The covariance matrix in this contact context immunised, and we don't when people are modeling images, they typically just kind of take this as fixed, but you could also do that. So a key thing that variational auto-encoders do is introduce what's known as an approximate posterior, so this Q Phi is going to be an estimate of the true posterior of the latent variables, given some image or given some input X and again, this is going to be represented as a conditional Gaussian. So what this looks like is we have some data instance.

A

We pass it through this network. That's going to produce this new and Sigma, which are going to be the mean and diagonal covariance of our conditional Gaussian, and that specifies this approximate posterior. So the next couple, slides I'm gonna like derive a bunch of maths and then I'm gonna, go back and give some like good intuition, because when I first learned variation, a lot of encoders like this is too much math I, don't know what it's doing and then, when you actually kind of understand how it all fits together.

A

It's a very, very simple framework that is applicable super widely, so bear with me as I go through this. So okay, so we have our log likelihood again. We had these latent variables Z, and so we can just express the likelihood as this product of the prior and the conditional distribution of the data, given this latent variable now, all I've done here is just add in something that equals one. So again, this is our approximate posterior.

A

So now I am just replacing the integral over the latent variable Z, with an expectation just kind of definition, of what an expectation is so Jensen's inequality lets us pull the log inside the expectation, and now we have this inequality and now I'm really just rearranging terms, and so this term on the right here. This is just the KL divergence between the approximate posterior and the prior distribution that we have, and so this this this bound here so that we referred to as the evidence lower bound or the variational lower bound okay.

A

This does help so right, so I'm going to come back to this equation in a second and give some more intuition for what these two pieces mean.

A

This is really gonna end up being like a reconstruction loss once I kind of give the autoencoder intuition, and then this here is this KL divergence between the approximate posterior and the prior, so first before kind of getting into the intuition here, I just want to like a look at when this bound is tight, so I've just rewritten and copied over this variational bound onto this page now, I'm just rearranging terms, basically so now, if I pull out the sort of multiplicative components of the logarithm I get these two pieces, this piece on the right here.

A

This is just an expectation of a constant, and so this just turns into a log of X, and then this here this is just actually the KL distribution between P and Q. So, just rewriting that now we have the data likelihood and we have the KL distribution between our approximation to the posterior and the true posterior, and so, if we just sort of move the the likelihood term to the other side, we see that this bound. So this is the thing we're going to be optimizing.

A

This is the thing that we want to optimize, and these two things are equal when our approximate posterior equals the true posterior cool, so so now I'm going to kind of flip it around and look at. You know how this is implemented with neural networks and kind of I personally find this to be a little bit more intuitive, so cool, so so variational encoders they're named that way because of their resemblance to traditional autom encoders. So just remember, auto encoder is something I hope autoencoders were covered already basic idea with an auto encoder.

A

Is you have some data instance? You encode it to some typically lower dimensional district or lower dimensional space, and then you reconstruct that- and you typically have some kind of reconstruction error, for example, mean squared error in your input space. So variational auto-encoders look very very similar to this, and we can also kind of think of them as a stochastic and regularize version of an auto encoder. So here we have the observation model. So again we have these prior distribution, which we sample from we're.

A

Gonna have some network which will be our encoder a network and it's gonna lose our decoder network and it's gonna produce the mean of a Gaussian distribution, and then we can sample from that distribution and our recognition model, which we're also going to refer to as the encoder model. Here we have some data sample, we have an encoder which produces the mean and diagonal variance of a conditional, Gaussian distribution, and then we can step over that so just to write this out in a way that kind of maps it back to auto-encoders right.

A

We have our input, we get mean and variance. This defines our approximate posterior distribution. We can take a latent code. This could either be sampled from our approximate posterior or sampled from the prior. We decode this and then we get our distribution over data instances. So again recall this likelihood term. So we had this reconstruction term and this prior term, so I'm just going to kind of look through like what go through what this looks like so.

A

Right so the prior term. Actually, let me go through some tricks first and then I'll come back to this, so yeah.

A

At the decoder level, so most cases we don't do it, but so if you don't do it, then this is really just going to reduce to a mean squared error. But if you do learn it, then you can just back propagate through this still.

A

Not yeah so in this example, here it's fixed yeah. You could learn it the way that you learn this one which I'll get to like how you actually learn this one, but in this case I'm just going to treat it as fixed I'm doing this, because I mostly work with images and then people work with image datasets they just kind of don't bother with this yeah exactly yeah. Also, the kind of intuition here I think is nice and easy.

A

If this is fixed because then the the likelihood of of X under this model, it's a Gaussian and so everything else just kind of falls away, and you just have this kind of mean squared error. So it's a kind of nice intuitive mapping onto auto-encoders, but you could totally learn it and I'll describe sort of how you learn it in this case and that kind of coffee is over cool. So yeah here, I'm saying this kind of reduces to mean squared error cool. So so optimizing, okay!

A

So here this term here is really easy to compute what they go Xion, but actually optimizing. This expectation is a little tricky, so some cool tricks were proposed a couple of years ago, which is referred to as the Li parameterization trick. So the idea here is to rewrite the random very Elzy as a deterministic function of another random variable in this case, I'll call it epsilon. So this is just kind of a generic example of what this would look like in the Gaussian case. We can rewrite our random variable, as the mean.

A

So this is the mean output by our network, plus this kind of diagonal, covariance term multiplied by a zero mean, give us the in variable so now. This means that this expectation that we want to optimize and we can actually rewrite it in terms of this epsilon variable. So what this means now is that this reconstruction term is really easy to optimize and, and then this the expectation here, this expectation here we just take Monte Carlo samples in practice.

A

People often just do like one sample, but there are lots of extensions that kind of look at optimizing this better than that, but for now, let's just assume one sample. Okay, so now kind of going back to this auto-encoder framework, the two different pieces of the lost term look like this. So here we have our KL loss right, so we have our encoder, we output, this mean and variance, and this kale term, if our prior is Gaussian, this is just can be computed. Analytically, we can get the the great-aunts directly.

A

This is also easy for a larger family than just a Gaussian distribution. That's very easy and simple there and then the second part of our loss function is this reconstruction term and again, this essentially kind of boils down to an l2 loss here and so again, a nice kind of intuition here is that you have this. This basic kind of auto encoding framework here, plus this additional kind of regularization term.

A

That's one way of thinking of it, and so these two losses really they're going to be kind of like fighting one another right, because in order to as effectively as possible, reconstruct an image you're going to want to stick as much information here as possible, and- and so this term here is kind of pushing this distribution to be a zero mean, which is essentially regular eyes against these two terms are kind of going to be an opposition cool.

A

So there's a lot of different extensions of variational Ottoman coder models, just gonna like stick a couple of references here right, so we have some sequential models. These models, all kind of do basically the same thing. They differ mostly in the way in which they represent the prior. So in all the examples that I was kind of working through so far, we had a fixed Gaussian prior.

A

If you have time series data, you can imagine, instead of having a fixed, Gaussian prior, actually learning the prior at each time, step and having that depend on either previous data instances or previous states of your RNN stuff. Like that, there's a lot of different applications here, right so kind of like modeling speech and handwriting and natural language, music generation. This is another cool thing that came out of magenta and yeah.

A

They do generation knowledge attention, lots of things, so another kind of useful extension of a VA II is called the beta BAE, and so the basic idea here is to just introduce this new term, which weights the KL term and higher weights will encourage increasingly disentangle representations in the latent code, and so here disentanglement refers to sort of each component of the latent code, kind of representing a different sort of semantically meaningful property of the image, and this can be useful again if you want to use this latent space for some kind of downstream task.

A

If you want to do maybe a nice controlled generation of images or of you, whatever data instances, you're working with. So that's a cool framework for that, the VQ BAE is a vector quantity ie. So this extends the basically a framework to discrete lately codes. This is a really nice generative model.

A

It gets pretty good sort of image, synthesis results and I'll say more about what an auto regressive, Ducote decoder is later, but they use a very powerful decoder, but also learn a latent space, and these are just some examples of images synthesized by this model. So a lot of this talking to me showing you images, because most generative models are applied in the image domain.

A

These are quite good if you haven't looked at a lot of images from models and they were especially good in 2017- we'll see some models that do much better, but for 2017 this was like pretty cool state of the art image generation stuff.

A

This model is also used to generate video frames. So this is a nice thing. This is an example of kind of using a generative model for some cool downstream tasks. They learned this model and then once they had this nice latent space, they trained a sequential model in that latent space, as opposed to training. This generative model in pixel space, and then they can sort of synthesize these latent codes, and then they have this decoder that map's down to images.

A

That's a cool thing: okay, so I'm gonna go into a different type of generative model now with auto regressive models, so high level summary Auto, regressive models. They are fully observed models, so this is in contrast to latent variable models and I'll, get a little bit more into what that means in a second, it's a likelihood based learning method, so, in contrast to VA ease where we had a likelihood function, but we couldn't optimize it directly, and so we derived this lower bound Auto, regressive models to find a tractable density.

A

Basically, by specifying an ordering on the variables and then modeling, a product of conditional distributions sampling can be slow because it is this iterative process, there's no latent representation. So this is this comes from the fact that it's a fully observed model and they can be slow to train. Sometimes, although there's lots of different sort of efficient implementations, cool so as I said likelihood based method, and so we're gonna see that we can specify this especially the models that this can be optimized exactly the basic idea.

A

We have our data instance X, which we can break up into different sort of dimensions, x1 to xn and we're going to define an ordering on the components of X and so basically just based on rules of probabilities. We can rewrite P of X as this product of all of these different conditionals right, so we had this or we have P of X 1 P of X 2, given X 1 P of X 3, given X 1 and X 2, and so on and so forth.

A

So what this looks like in kind of graphical model form right is we have each of these each of these variables, depending on the previous one and this kind of sequential sampling procedure, where we first, you know sample our X 1 from some prior distribution over X 1 and then subsequently sample components of our data, given previous components, cool, so I'm, just going to kind of quickly go through like a history of different models here. Just so, you kind of like have lots of different references.

A

If you want to go look up on your own, this is an early auto regressive model showed pretty promising results on low resolutions has a feed-forward architecture, though, as we'll see we'll get into convolutional, architectures and recurrent architectures. This is a very similar architecture and so it kind of had limited expressive capability.

A

This was a specifically designed architecture to enforce the auto regressive property and they they apply to based in Ottawa encoding framework, and it made it a lot more efficient to compute because of these masks.

A

Ok, so now we're getting is a little bit more modern and powerful, auto regressive models, so on pixel, RN n. This is a deep generative model of images. The pixels are ordered in this kind of raster scan manner. So if this is an image, this is the kind of ordering like this or, like you know, left to right top to bottom, and then each pixel is generated conditioned on previous pixels, and so these are some examples of sort of image, completions and again for 2016. This was pretty powerful, also relevant.

A

They model pixels as a discrete distribution, as opposed to continuous ovae they're. Looking mostly at continuous distributions, although you can do kind of binary distributions there so then an extension of this is the pixel CN n.

A

So this is very similar in that we again we have this kind of like raster scan ordering of pixels, but now we introduce this convolutional model and they introduced masks convolutions in order to preserve the pixel order, that's important to have this regressive property, and so what this means is kind of like generating this pixel here it sees this kind of like region around it, but it doesn't get to see anything kind of to the right and below it.

A

So video pixel networks, basically an extension of pixel cnn's to a recurrent video predictor, and these produce pretty good results against Riven 2016 wavenet. This is a really cool extension, so this is very similar to pixel CNN, but it's applied to 1d audio signals, and so I think this is applied mostly to speech. We could also apply it to music and stuff like that. So it's a fully convolutional neural network, the convolutional layers have a dilation factor and this kind of allows the receptive field to grow. Basically exponentially it makes it much more efficient.

A

Actually, here's I think of a picture of the dilation there we go and basically you can think of like a dilated convolutional neural network as like a convolution that is like it's a very, very wide receptive field, but lots of kind of holes within it. So it can very efficiently in this case, captures with long term temporal dependencies, but do it in a very efficient manner, and so this is another cool model. I'd, look it up, it's it's, it's fun, it's cool and it could be applied to a lot of different kind of 1d signals.

A

Okay, so normalizing flows. So this is the next class of generative models that we're going to look at. This also fits within this kind of likelihood based methods where we have attractable likelihood that we're going to optimize so high level summary directed latent variable model so similar to the VA ii likelihood based learning again we're gonna, we'll see we can define the likelihood in a way that allows for exact optimization of the log likelihood.

A

So this is similar to the auto regressive models will be optimized exactly in contrast to the VA II, where we had to derive this bounds and there's going to be a really efficient, ancestral sampling procedure very similar to be a ease.

A

We have exact inference here, so, in contrast to the auto regressive models where there was no inference mechanism, because there was no latent variable here, we have exact inference and again the Dae there's an inference mechanism there. But it's approximate so here it's going to be exact.

A

The one kind of downside is that these can be slow to Train cool, so the basic tool that we're going to use here is called normalizing flows and so normalizing flows are basically a tool for constructing complex distributions by transforming a probability density through a series of invertible mappings. So I found this nice little graphic here which I like and basically it shows you know we sort of apply a sequence of invertible transformations, so f1 2 FN. These are all gonna or FK.

A

In this case, sorry, these are all going to be a sequence of invertible transformations, and so, if we sing in our kind of generative model space, we can have some prior distribution over our latent codes, again something nice and simple. And then we can apply a sequence of transformations in order to get this distribution over our data instances.

A

So they're kind of like two key concepts we need to understand. Normalizing flows is the idea of Jacobian of a determinant and change of variables theorem. So just really quick kind of linear, algebra, refresher, Jacobian matrix, is the matrix of first order. Partial derivatives and the determinant is it's one.

A

Real number, this computed as a function of all the elements in any square matrix only exists for square matrices, and it sort of intuitively is often described as a blessing as how much sort of multiplication by that matrix, expands or contracts space, and so we're going to be looking at the Jacobian determinant of these kind of functions, F that we use to transform these these variables and then K the change of variables.

A

There I'm just kind of kind of stated here and then use it, and this basely tells us how to infer the unknown probability density function of a new variable in this case p of x, given that we know P of Z, so we know P of Z G theta is some deterministic function of Z. We'll see you later, we can actually write. You know G theta as a sort of series of transformations and the change of variables. Theorem just lets us kind of write this likelihood term. In terms of the sorry this should say Z.

A

This is wrong. This should say Z, because this doesn't make sense cool so just kind of working through this in our generative model, so the generative process. We have this prior again exact same thing as the variational autoencoder. This differs from the variational autoencoder in that with the VA e. We actually had this observation model here. G theta is just going to be a deterministic function of Z.

A

So if you want to, you could imagine our observation model as being like a sort of direct Delta function on like one point and then again likelihood based method, so we're optimizing, the log-likelihood with respect to theta so right. This is just kind of repeating what I had. Except now we have. The correct is e here: okay, so I'm, just gonna kind of walk through the different components of the slide, because this kind of explains everything you need to know about flow based, generative models.

A

So we begin with an initial distribution P of Z and we're going to apply a series of invertible transformations. So we have this function f and you know we can write. We can basically write it as this like series of functions, and so the relationship between X and H 1 and H, 1 and H 2 and is to an and said, is each of these individual functions. And so this is just exactly what I wrote before, except now I'm kind of separating out each of these different layers.

A

So it's useful to kind of think of it in terms of these sort of compositions of transformations, because we're gonna build this with a big neural network, and so this means that we just kind of need a certain kind of property to be held for each of these individual components of our network. So this is just exactly what we were doing before the change of variables.

A

There I'm just expanding it out to this kind of sequence here, and so what we need in order to do this is for each of these functions fi to be easily invertible and for the determinant of jacobian to be easy to compute. So there's a lot of different kind of flow based, generative models that have been proposed, I'm just going to kind of point to a couple of them. So this is an early one, nonlinear independent component estimation and this basically stacked a sequence of invertible transformations. They were called additive.

A

Coupling layers, this real NVP model, built upon this. This, the main kind of distinction, is they change additive, coupling layers to affine coupling layers by adding like a scale parameter, and then they also introduce a multi scale architecture which allowed for more efficient models of large images. So these are some examples of images generated from this model which again, at the time, was quite impressive and then more recently. This is a cool model glow.

A

This basically builds upon the real MVP model, but it introduces invertible one by one convolutions and it ends up with a pretty efficient architecture, also kind of pulls upon the multi scale, stuff of this real MVP, and now we end up with some really nice images. So now, with this model, we're kind of getting into like the current Jenner generation of generative models, where we're having like really really good image synthesis results.

A

So this model we can also do cool things like interpolate between real faces, and so this is an example of why an exact inference mechanism is nice, because, with an exact inference mechanism, you can take a real face and code it into latent space and end up with, like you know these three late codes here, you can take this real face and code it into latent space and end up with these three latent codes.

A

Here, then, you can just do linear interpolation in that latent space, and this kind of smooth, linear, interpolation and latent space corresponds to a very nonlinear transformation in pixel space.

A

But this very kind of semantically meaningful type thing, and so you know you can interpolate between different celebrities, it's kind of freaky, so in this model here they also introduced this kind of temperature parameter which basically sort of specifies the variance of the Gaussian that you're going to be sampling from, and so, if you kind of sample closer to the mean you end up with basically the same face and then, as you change, the temperature parameter you get increasingly diverse faces, but also like increasingly terrifying faces.

A

So this this model, that's like I, don't know these faces, maybe are where you want to be. But this is what I've played around this model a little bit, and one of the one of the downsides is that in order to get kind of really nice quality results, you end up having to sacrifice diversity a little bit. um So some extensions of this stuff here. This is a flow based generative model for video.

A

This is just all of these video models. Kind of use, these robot arms that push around objects on the table. So any video examples that they show are basically just going to be this: okay, so generative adversarial networks. So now we're going to be getting into a different class of generative models that don't rely on maximum likelihood estimation, so I'm gonna kind of expand this out a little bit and this so there's a nice paper here which I would I'm not gonna like cover too much of it.

A

But I think this is kind of nice further reading to kind of understand the difference between this kind of likelihood based approaches and these implicit density, modeling approaches I'll talk about a little bit of the stuff that they do, but I think this is just kind of good reading if you're interested in generative models, so the intuition with implicit, generative models is instead of doing this maximum likelihood estimation where you are optimizing, your parameter, so that your data has high likely under your model.

A

Instead, you you say: I, don't even I, don't even care about her legs. Good functional I, don't need it. I could have it but I'm not going to use it and in in all of these cases, the generator at Paseo networks and matching networks. We just aren't going to have it and instead what we have access to is this generative process, and the idea is that training is going to proceed by comparing sets of images essentially between the data distribution that you have access to through your training set.

A

An image is sampled from your generative model and these different sorts of approaches moment: matching networks, generative adversarial networks, after divergences I'm going to talk about very briefly. This is kind of a a broader class of learning, metrics and general raduege. Channel networks can kind of fall under these under certain circumstances, but basically, all of these approaches kind of learn through this comparison approach, so I think that's just a repeat of the same slide. Oh wait! Now. I'm gonna go to general bed for sale networks, so cool, so don't ever say all networks.

A

I'm gonna spend a decent amount of time on because they're used everywhere. Now. This is a slide that I took from another talk which I really like. So this is the number of and papers by a month, starting in 2014 up until I. Don't know this is maybe 2018 so like just huge huge exponential spike so again like this is also a huge area. A lot of these papers are different kind of like tricks and techniques to improve stability of training. Some of them are defining you loss functions that are slight variances, the original ones.

A

Some of them are applications there's just so much here and so I'm gonna I've tried to be like a little bit selective and give like a good overarching kind of view of what gans do a tiny bit of theory and then some extensions that are being proposed in the last couple of years, so just to sort of show I've shown a lot of images so far. So this this kind of shows the progression of generative adversarial networks from 2014 up until 2018, so these are all synthetic faces.

A

None of these are real people, and this shows so in 2014 when Ganz originally developed. This is this was kind of like state-of-the-art generative modeling of faces. At that time, its general volley images is really really hard, and then we can see you know just a couple of years later. This is from the style again, which I'll talk about a little bit, but now we're able to get. You know super super high-resolution details like high quality image synthesis, so, okay, quick, summary, so ganz, again directed latent variable models.

A

So this is the same as the flow based models same as variational autoencoders again gonna.

A

Have this really efficient ancestral sampling procedure where we sample a latent code, and then we sample an image given that latent code I already mentioned, we it's an implicit model, and so we just define the generative process, but don't define an explicit log-likelihood, there's no inference scheme that really comes with Gans, there's a lot of methods that have been proposed to different types of approximate inference, but because we just have this sort of generative process, that's defined, there's no kind of natural inference mechanism and the one kind of key downside of Gans is that they can be super unstable, so we'll see as I go through a lot of the a lot of the improvements are really just kind of trying to deal with stability issues here.

A

So this is just what the generative process looks like. So this is similar to the glow model, glow model, the flow based model where we have some simple prior and then our X is a deterministic function of that latent variable again, in contrast to the BAE, where we have some probabilistic model here. This is just a terminus of function and similar to the flow based model. So if you want to think of this as a probability, you can think of it.

A

As this kind of like you know, all the density is in one single point cool, so right, no explicit density, so the intuition with ganz is that we're going to learn via a two-player game. So we have our generator that's kind of defined up here and we have a discriminator and the discriminator is trained to distinguish between samples that come from the true data distribution which we have access to through this training set.

A

So the discriminator is going to distinguish between those real data, samples and samples that come from our generative model, so you can think of the discriminator sort of as a classification model.

A

The actual loss for the discriminator is going to sort of change depending on the framework, but essentially it's just trying to as best as possible distinguish between these two sets of images, and then the generator is trained to produce samples that fool the discriminator, and so the training is going to kind of proceed in this kind of back-and-forth fashion, where the generator and discriminator are constantly updating and learning. And so you know, as the discriminator gets better at differentiating generated samples from true data samples.

A

The generator is going to have to get better at fooling the discriminator and basically get better at producing images or data instances or not in the image domain that look like real data. So what this kind of looks like a network form we have this discriminator. So we have, you know kind of two cases, two types of data that the discriminator sees. If it sees real data, it's basically trying to produce something close to one again. The exact loss function will get into and it will change depending on the framework and then the discriminator.

A

If it sees samples from the model, it's gonna try and produce something close to zero and then so. This is, if you just kind of take these two things here. This is, you know, just a simple kind of supervised training problem, basically for the discriminator where it gets.

A

You know two different types of samples and then the generator, so this kind of, like Zee's sampled from the prior combined with the generator, this really just kind of defines a model distribution and the generator is trained to make the discriminator think that this sample is real and so close to one and so we'll see the way the generator is trained is basically sort of forward propagation. Through this network, the generator has some loss function. Then the gradients are going to be propagated through the discriminator and back to the generator.

A

So when the generator is being learned. So when the gradients are passing through here to the general, the discriminator was going to be held fixed and then, when the discriminator is learning, the generator is held fixed. So this is what the the sort of originally proposed loss function looks like. So here we have this first term here. This is the sort of expectation under our data distribution.

A

So again this would be like approximated with just samples from our training set, and this is the log likelihood of that data from the true data distribution are considered genuine by the discriminator and then in contrast here now we're sampling latent variables from our prior distribution, feeding them through our generator. So this G of Z here gives a generated sample or a fake sample, and so then this here is the log probability under D that the samples from the generator are considered fake. So this is really just like a crud.

A

If you ignore the kind of G stuff, this is just D is optimizing: a cross, entropy binary, cross, entropy loss function and then G here, so we have max over D min over G. So this means that the generator is basically trying to fool the discriminator and so the the generators gradient comes through. It gets back propagated through G or through D. So again, as I mentioned, this is like an alternating optimization procedure.

A

The optimum is a saddle point in the original kind of gain paper being good, fellow and sort of goes through the maths and shows that minimizing the Ganic, an objective function with an optimal discriminator is equivalent to minimizing the Jensen's Shannon divergence.

A

Again, a lot of the kind of theory around Ganz holds under optimal conditions and we are never under awful conditions. So a lot of this stuff is kind of giving a little bit of intuition. But you know we're not saying anything super precise, but another sort of important intuition here so earlier on. At the beginning, I kind of talked about the difference between minimizing forward KL and reverse KL and Jensen's Shannon divergence, and what that looks like for different generative models here, we're minimizing the Jensens, shannon divergence, and so here's a good slide.

A

Ok, so here we're minimizing the Jensens, shannon divergence, and if this is our data distribution, the jensen, shannon divergence, sort of encourages the model to only put density where there is actually data, and so gans tend to, in contrast to models that optimize KL divergence they tend to put.

A

They tend to have higher quality samples, but they might miss modes. So just in contrast, we'll look at this is like the KL divergence. So maximum likelihood estimation should have said this. The beginning maximum likelihood estimation minimizes the KL divergence between the forward KL divergence between the data distribution and the model distribution, and so a lot of early max and likelihood based models, typically VA ease for the most part.

A

They tended to produce samples that were quite blurry weren't as crisp and as sharpest as Ganz, and one of the ways in which this was explained was because the KL divergence tends to emphasize like capturing all of the different modes and basically putting model density anywhere. That data lies, and sometimes this comes at the expense of putting density.

A

You can see in this kind of, like I, can't see with the pointer in this middle area here, basically where there is no no data, but the model ends up, putting density there, so just kind of like a little bit of intuition for like the different trade-offs of these models, and then this this paper that I'm, citing here this this image comes from this paper. I would also just recommend. Reading this.

A

It kind of gives some good intuition for the different types of things that might be optimized in different generative models, cool, so okay, so the first loss function that I showed a couple slides ago that actually provides really poor gradients early on in learning, when the discriminator is really certain and so in the same initial gann paper and Goodfellow also proposed this alternative objective.

A

So this this is more commonly used. It's often referred to as the non saturating again lost. So basically, these objective doesn't change at all, but the generators objective changed so in the original objective, the generator was just trained to minimize this function, and so G only really shows up in this part, and so G was trained to minimize this year. In contrast, this new function- G is trained to maximize this here.

A

So what this looks like in practice is the discriminator has real images labeled with label one fake images or synthetic images labeled with label 0 its trained with the binary cross-entropy lost, distinguish between those two sets of examples and then the generator. Now what we do if we were implementing this is, we would just flip the label of generated images from 0 to 1, for when we're optimizing, the generator feed that, through the discriminator with this kind of flipped label, get the gradients in the discriminator and update the generator based on that.

A

So so this is when most of the time when people just say, they're using like a standard, Gann loss. This is this is what they're, using as opposed to that previous version cool. So, let's see so there's a lot of different challenges that come with generator adversarial networks.

A

There's a huge, huge, huge amount of literature that people have developed, trying to understand properties of ganz trying to provide some kind of sort of theoretical understanding of why they work when they do and why they fail when they don't so I'm going to try and give some intuition here, but then also there's a lot of different things you could read so this I found this blog post, which I thought was quite nice.

A

W Gann is Wasserstein Gann I'm going to go into that specific model in a second, but this is just generally kind of like a good blog post. That I would read so cool, so this paper from 2017.

A

This is another I, think really insightful and useful again paper to read and they kind of go through a couple: different sort of problems with the the typical Gann training procedure, looking at kind of stability stuff. So so the first thing that they observed is that both the data and the the generative distributions a very likely lie on low dimensional manifolds, and so so what this means.

A

Basically, so this is three dimensional space and really like a one-dimensional or two-dimensional manifold, and so, if these sort of you know red and and blue lines or planes are generative and data distributions, they really don't intersect very much there. They might be entirely disjoint or there might be a very, very small region of space where they intersect, and so this paper here kind of talks about how that that's really challenging for gand training, because it means that you can actually like find a discriminator.

A

That kind of can perfectly distinguish between the generative distribution and the true data distribution, and so this means that you know, if you have this kind of perfect discriminator, then its gradients, the generator, is basically get like zero gradients everywhere. So this isn't super useful. So this is this is an experiment. This is a plot from the same paper or basically so I kind of said earlier.

A

A lot of the like nice theoretical results from Gans come under the conditions of like an optimal discriminator, but in this paper they actually sort of look and they say like okay, if we actually have an optimal discriminator. This is actually really bad for the generator because of this vanishing gradient problem, and so what they did here is they. They basically continued to train the discriminator for increasing amounts and then looked at the gradients that the generator got for each of those discriminators.

A

So here it's like, as we move along the x-axis, the discriminator is getting better and then this is this is the norm of the gradient and we see that it actually degrades, and this is a believable yet so log scale, and so so this is, this is kind of. This is problematic. This is this weird kind of like dilemma: the gans face where you know.

A

If the discriminator is really bad, then it's like not giving good feedback for the narrator, but if the discriminator is really good, then the the generator might not be getting enough signal from the discriminator and so there's a lot of different things that people have kind of been proposing over the years to deal with this so instance. Noise is one example, so this has been proposed in a couple of different papers early on I.

A

Actually, don't think it's used that much now, but it's kind of an interesting sort of you know historic, yeah yeah, so um in the traditional loss function here, because we have this log, this is going to saturate. So if the discriminator is super super certain, so if the discriminator is really really good and it's super confident and doing a really good job of discriminating real samples from fake samples- it's just going to saturate because of this log here.

A

So this was one way of kind of dealing with that and then we're gonna see a lot of different ways of trying to deal with that.

A

So the yeah. Sorry, though, like the log function just kind of plateaus at a certain point, and so there's like no more signal once you reach a certain point, yeah so I think what you're referring to is called label smoothing. So in the traditional sort of Gans framework, you have labels of 0, & 1 for sort of synthetic and fake data and real data on something that was proposed.

A

I think in this paper called like improving training, up dance and ghost in 2016 paper they proposed label smoothing and which is basically instead of giving 0 1 labels either give like noisy labels or give labels of like 0.9 and point 1, and so that you end up kind of like smoothing at this distribution a little bit, and that does help a little bit this. This instance.

A

Noise is another thing that was proposed and the idea here is basically just to add a small amount of continuous noise to the generated images before they go into the discriminator, just kind of help, smooth out the generator distribution and there's a little bit more sort of theoretical underpinnings of this.

A

In again and this this paper here- and this was also proposed, another paper- and so then there's also a lot of work, moving from cost functions, moving to cost functions that don't have vanishing gradients and so I'll talk about like a little bit of those and oh yeah. So this is also relevant.

A

So another thing that very frequently happens with ganz is referred to as mode collapse, and the idea here is that at some point in training the generator just kind of collapses and starts producing a very small set of images, and so it might be a single image. It might be a small set of images. But basically it's images that, like look like total garbage, if you're looking at them, they don't really reflect the data distribution but they're.

A

Fooling the discriminator still and sometimes you'll see this kind of cyclic behavior, where the generator will kind of you know produce this set of images here and then the discriminator will catch up and you know start to be able to discriminate those. But then it'll shift over to this other set here, and this is a really common thing that happens.

A

So this this figure here is just kind of describing this, so this is like if this was so. This figure here comes from this unrolled gaen paper, which is a cool gang architecture. It's not it's not really used much that I know of now, but it's kind of a theoretically pleasing type thing. I'll, explain it in a second, but first this this. So this is our target distribution right.

A

We have all these different sorts of modes here, something that might happen throughout training is this: is the generative distribution at different time steps in training and we can see the generator kind of hops around to the different modes, and so intuitively?

A

You could imagine that, like you know the generator kind of gets this mode and it's fooling the discriminator really well, but then all of a sudden, the discriminator realizes that, like you know, oh actually, this kind of looks like the training datum, so the generator hops to another mode and it just kind of moves around and- and so this is, you know. This is a very tight example of that.

A

But you tend to see this a lot when you're training hands, so this unrolled can basically kind of propose that the generator actually take into account the discriminators gradient and to the unrolls kind of refers to like optimizing through both of the different objectives. So another thing that has been proposed is to you, whose batch statistics, so the idea here this was kind of proposed in different ways in a couple of different papers.

A

But the idea is, instead of the generator Sorenson the discriminator just seeing single instances, it actually gets to see a large set of instances at each time, and so you can implement this by just giving the discriminator access to all of the images in a batch, for example, and so now the discriminator can see if you know, there's a huge amount of diversity in the real data set. In these you know, say: 100 images in the batch, but the hundred images that are coming from the generative distribution have very very limited diversity.

A

Then it can easily distinguish these two distributions and then generate. Our conditioning is another thing that I'll get into in a little bit, but this is also been observed to help with mode collapse so cool.

A

So another challenge of gans and I'll get into this alone a little bit. It's just kind of like an evaluation criterion so with likelihood based methods. There's a clear loss function that you are optimizing and it goes down as you train, or hopefully it goes down and if it doesn't go down, then you know there's something wrong. In contrast, most Gann frameworks are kind of optimizing.

A

If this alternating optimization procedure, the Optima might be a saddle point, the actual, like value of the loss function, doesn't tell you that much, and so this can make it really tricky. When you are, you know trying to decide what a good stopping criterion is when you want to do model comparison when you just want to like run a giant hyper parameter, sweep and pick your best model. If this becomes really really challenging and so you'll see, you know a lot of a lot of papers in this area.

A

Kind of rely on, like you know, human evaluations or like hand-picked samples and stuff like that I'll go into some different evaluation metrics that have been proposed. That kind of capture different things in a little bit, but this is I think still like a pretty open problem.

A

Okay, so, as I said, there's a ton of different gann losses. This is not even like half of them. This was just a nice table that I grabbed from this paper, there's so many different versions. So this is the original kind of Gann objective that I proposed. This NS can is the non saturating Gann, which is used very much in practice, I'm going to go through the W game, which is the loss of sine Gann and the W can GP, which is the washer son Gann, with gradient penalty.

A

There's way more like this table should actually go on forever, because there's so many different variants that have been for post I, I'm gonna go through the wash aside again because it is used quite frequently, I use it I like it it's fairly, stable and um yeah. That's enough justification, so cool okay. So, just really briefly, I'm going to give some intuition for the Worcester Stein metric, which is what the Worcester Stein can uses. So again, this is I. Think I might have referenced this earlier, maybe not anyway.

A

This is a cool blog post I took all these figures or like augmented these figures from this blog post, because I liked it and when I was learning about this metric, which I didn't know before I learned about Worcester sign, Gans I read this and it was useful, so The, Watcher same distance is the minimum cost of transporting mass when converting one probability distribution into another.

A

So in this case we can think of, like you know our model distribution in our data distribution and the kind of cost of transporting mass there, so I'm just going to go through a really toys, amble, which will hopefully give some intuition, but basically, you can think of you know. We have KL divergences, that's a way of measuring distances between distributions. This is another type of way of measuring distances.

A

So basically, in this toy example imagine we have a bunch of boxes. You know the number of boxes in each in each particular location. You know in this kind of example, represents the the probability mass in this location and we want to convert this one distribution into another distribution, and so we're going to do this if I, just kind of like moving these boxes over and so the cost of moving a box is the weight of the box, which just assume is one for now and multiplied by the distance.

A

So if we were to move this box six over here, the the cost would be seven ten minus three. So um basically the washer sign distance is the cost of the cheapest plan, and so we can kind of define a plan based on sort of each of these different boxes. How we move them here so I won't go into too much depth here. But if you go back and look at the slides of this blog post, it kind of like goes through this and more depth.

A

But basically this you know kind of plan here describes where each of the boxes here are moving and each of these distributions are going to be the marginal distributions, and so this is saying you know we had. Actually these distributions should be flipped, because we have sorry. This is bad, but here we have sort of three boxes here and then we're moving them over to these different locations here. So rights may and have a whole bunch of different plans.

A

Each of them will have a different cost and the washer sign distance is the cost of the cheapest of these transport plans. In this particular example, here they both cost the same. But you know this is a simple example of how you know they can cost different things. So, okay, so kind of writing this out mathematically. We have our data distribution and our generator distribution, and these are the different plans that we have these gammas and then this altogether here is the washer stein metric also called the Earth River metric.

A

So this paper I would recommend reading this goes into a lot of like really nice Theory, until why this type of metric makes a lot more sense when you're, comparing so different probability, distributions and a lot of the intuition here that they go through is just because you end up with sort of no gradients in certain places if you use KL divergence or reverse KL or Jensen Shannon, but you end up with gradients everywhere. If you use this, and so they work through a couple of really simple toy examples there.

A

So of course this is, this is intractable. So of course it come up with a nice approximation, so the approximation that they propose in the in the paper uses this duality, which I honestly don't know much about so I'm. Just going to point, this blog post I haven't really seen this kind of explained much in most and papers. It's just kind of stated so I found this blog post and I would encourage you to look at this blog post, because I can't teach that so cool. So this.

A

So now, because of this duality, we have this. We rewritten this Worcester Stein loss in this form, which looks very very similar to the discriminator I'm, sorry to our Gans setup from before. So we had this expectation over data sampled from our data distribution and this expectation over generated generated images. Here we have this F function which, in the you know, Wasserstein framework is typically called a critic, but if we just sub in D here, then we basically have the you know.

A

It looks a lot like the gain function, so just to kind of compare these side-by-side. This is the original gain objective. This is the watcher Stein Gann objective. Here we now have a constrained, optimization problem, because we need this discriminator function to to be one Lipschitz, and so the watcher Stein can paper. Originally they enforced this through weight clipping.

A

So that's not the greatest way of doing it. They did get some good results and I think when that paper came out, it was state of the art when it was originally proposed, but there's different ways of enforcing this constraint. So this was her signed, Gannett, the gradient penalty. This paper is much more commonly or sorry. This formulation is much more commonly used and basically the idea here is just to enforce the constraint by adding a gradient penalty that constrains the discriminator to have gradient norms less than one everywhere so cool.

A

So now, with this formulation, we have some really nice samples. So these are a bunch of synthesized bedroom images. I always feel like people who, like jump into generative models. No contexts are always like. Why are you generating bedrooms? But bedrooms are it's just a really common data set that people use in the general world? So that's why we're looking at bedrooms- and these are these- are all synthetic images but they're all pretty high quality, and so again this was kind of state-of-the-art when it came out. Okay, so another set of Gans stabilization techniques.

A

So so far, we've looked at a couple. Different loss functions now I'm just gonna kind of stick in some different architectural improvements. These are just things to think about when you're building a can model, so Bachelor organization, so bachelor, realization and the idea of avoiding sparse gradients. Both of these ideas were initially proposed in the DC and paper that came out in 2015. This is just kind of an example of what that architecture looks like I'm, just really sort of simple tricks that just really improve training, so basically back to norm, works by normalizing.

A

The input features to have to a layer to have zero mean and unit variance, and it just really helped with stability. It also sort of empirically helped with mode class a little bit and I think there was some intuition that people had that it kind of like helped to deal with poor parameter initialization in some cases. Cause batch norm is generally helpful. There then virtual batch norm is you know, kind of a different variant here where each example is normalized.

A

Based on the statistics collected from a reference batch as opposed to in batch norm, the statistics are collected for the given batch, that's going through the network, so yes, this was another good thing and then generally it's good to avoid sparse gradients. So you know, instead of using Raley's use, leakier a lose, and this just kind of helps improve the signal from the discriminator to the generator spectral normalization. This is a more recent paper.

A

It's a weight, normalization method that has also been proposed proposed to stabilize training- and this is, you can be thought of as, like you know, an alternative to the kind of weight clipping or gradient penalties. Another normalization technique. This is so it was originally just applied in the discriminator. These synthesized images are coming from this paper so again, now we're really kind of getting into state-of-the-art image generation stuff. These are synthetic pizzas and synthetic cats because everybody loves cats and pizza.

A

So these are good examples to show, and then this work later applied at both the discriminator and the generator and found it was helpful there yeah. So this is a cool paper also I think worth reading, but basically they looked at the conditioning of the jacobian of the generator and found that, like they found empirically that poorly conditioned generators and it was very predictive of the different sorts of like metrics that people use to evaluate generative models, so okay, multi scale architectures. So now we're going to get into some like architectural stuff.

A

So the idea here there's a couple of different models that have been proposed: the basic intuition here, it's just kind of incremental growing of the gun, and so you might train an initial model on very low resolution images cuz, it's quite easy to model low resolution images and then, given that model train another model on a slightly higher resolution, conditioned on that lower resolution, image and then kind of so on and so forth.

A

Progressive growing up, ganz. This is another good paper. This is I'd, say also still kind of state of the art, and the idea here is that we start the low resolution image and then progressively increase the resolution, but this time by adding layers to the network in an incremental fashion. As opposed to the kind of multi scale stuff where it's like, you train one model at this resolution, and then you fix that and you trade another model.

A

Another resolution on here the layers of the network right is kind of progressively being added in, and this is nice because it allows the models work early on to capture very coarse-grained features of your data set and then over time kind of focus on more fine-grained details. These are some samples from the progressive growing up Ganz paper. These are all synthesized phases. It was trained on a data set of celebrities, so they all look like celebrities. I, don't know these are 1024 by 1024 pixels.

A

This is quite a high resolution, so long rage, dependencies and global stress sure, still kind of remained well remains and remained a challenge. This is getting better for the self attention. Gann was proposed as a way of kind of dealing with this, and so at this point, when it was proposed, begins were really good at, like you know, kind of low level textures, and they were also really good at faces that have a lot of symmetry and a lot of structure, but for just kind of generic objects.

A

Cans would have trouble with like counting and just kind of put like six eyes on faces and stuff, and so you'd see images that, like from far away, looks like an image but close-up not really so. Basically, the self attention Gann introduced an attention mechanism to the generator and the discriminator, and this kind of helped learn kind of long-range dependencies and more global structure, and so now, with a self attention again, we see really kind of coherent global structure.

A

These are samples of like dogs and different kinds of birds and fish and yeah just like much more coherent structure: okay, big Gann- this is another good one. This is another paper I'd recommend reading, because this is the state of the art right now and these. These are all synthetic images and they're really good.

A

What's cool about this paper is that they just did a bunch of really simple things, so they did things like increasing the batch size. Here's the batch size they just like increased it, and already that improved things. These are different kind of metrics, which I'll get into in a second and smaller, is better. Here, larger is better here they increased batch size. They like double the number of channels in the layer.

A

They added skip connections from the noise vector, so instead of the noise vector just going directly into the generator, it would go into each layer of the generator and they had a couple different versions here. They would either give the exact same kind of noise back to each layer or they each layer.

A

We got a different sort of chunk of the noise vector and then they also saw a really good results by sampling from a truncated, Gaussian distribution, and the idea here is that, when values sort of fall outside a particular range, you just release sample within this range, and this didn't work with every architecture, because you're essentially sampling from a different distribution. You saw during training but with some other kind of tricks. They made this work, and so here now we see okay. So this is the effects of that truncation.

A

So basically, this is similar to that glow model. That I was showing earlier, where you had a temperature parameter control, and so you can kind of generate very, very regular images or get more diversity, and you can see that, like the quality of the image improves as you kind of limit the diversity, but you can kind of like you know, pick something in the middle here and it looks pretty good and then yeah. This is just some things that the big and still struggles with so on the Left.

A

These are like easy classes where there's like a lot of texture or things are like you know, centered right in the middle that does a really good job there, but generating like people and really kind of complex intricacies here you know or not, we're not quite there, yet style based generator architecture, gonna kind of brush through this.

A

But this is another method where basically I'm the kind of take techniques from style, transfer, some style transfer models and bring those into the generator model, and so this is basically they're building off of the progressive, an model but just improving the generator and now we're starting to get just like kind of scary image. Generation of people like these are all synthetic faces that came from this file again model, and then this model also does a good job on just kind of other objects.

A

So these are like bedrooms again and cars and we're seeing like good kind of global, coherent structure, cool I'm, actually gonna rush through this. There is I'm just going to point to this paper, because this is again some nice theory, this kind of Afghan paper they basically look at so F. Divergences are sort of class of divergences again for comparing different probability distributions and they they show some nice theory that kind of unifies a lot of different Gann frameworks and other kind of implicit models within this. And yes, oh just read that paper matching networks.

A

This is another framework I'm, just gonna I'm, gonna kind of skip through it, similar to Ganz in terms of the actual architecture of the model, ancestral sampling procedure, all of that kind of stuff but stable, because you don't have this kind of like alternate optimization procedure, but they haven't really caught and the samples are not amazing. Although there is some stuff where people are like combining Gans and the matching networks- and you can also use this- no one matching criterion as a way of evaluating ganz.

A

So if you're interested go, read this further I'm gonna skip through here, okay, so evaluating generative models. So this is something I kind of skimmed over throughout sort of obvious way of evaluating generative models is to look at the log-likelihood again. Look at the the likelihood of your dataset, typically I, held out data sets that you didn't see during training and look at the likelihood of that data under the model that you've learned.

A

So this is a sort of natural way in which a likelihood based methods are compared with one another, but this isn't really viable for implicit models like ganz again, because there is this, there's no kind of explicit density function. Early works. Try to leverage this kind of cars and window approach where you take a bunch of sampled images and you place it give scene on top of each one, and you say that this mixture is your density model or an approximation to it.

A

But this doesn't really work because high dimensional spaces that just doesn't really make sense, so other things that you might care about. Our perceptual quality of the generations diversity, whether my model is over fit or fitting, is really important because you know, if you just like, take your training set, then that has a lot of diversity and the perceptual quality is really good, but that's not a great generative model, and so overfitting is really important to check and then also you know, utility for some kind of downstream tasks.

A

That's also important, depending on like why you're building the model, so a couple scores that people have proposed to kind of quantify these different types of things. What is the inception score?

A

So the idea here is like given an inception Network, which is a classification model, that's been proposed, but you could also just take any kind of powerful classifier, but given given some classifier, the intuition is that samples from your generative distribution should have a highly peaked conditional label distribution.

A

So this means that, like when I take a generative sample, pass it through this classifier I want the classifier to be very certain about what class it is, and the intuition here is that high certainty sort of corresponds like meaningful samples, and then you also want to have a uniform, marginal distribution across these labels.

A

So this means that, like you know, given a huge set of generated images if I look at the marginal distribution over the labels from all of these I'm, going to kind of equally hit all of the classes, and so the inception score looks at the KL divergence between these two distributions right, because you want one to be a highly peaked while waiting to be uniform. And so, if we look at the KL divergence, then the higher the KL, the more we're satisfying these two things cool.

A

So this is um this is nice because it gives us kind of like one single number: it doesn't some limitations or it doesn't care of them within class diversity. So if you just generated a single instance of each class, you'd get a really high score here. Another limitation is that it doesn't actually rely at all on the data distribution, so nowhere here am I using the data distribution when I compute this, although you could imagine training the the classifier networks that you use on your data distribution.

A

If you had labels for it, and so you could kind of get it in there, but it's a bit weird because it doesn't use that and then also there's no kind of measure of overfitting right. You could just you know, reproduce exactly instances from your training set and you'd have a you know: high Inception score.

A

So this is what I just said: I forgot I had a excited about it, okay, so this is another metric. That's been proposed more recently, and the idea here is that we're kind of improving upon the inception score.

A

This is because you can have a good inception score if the model generates just one image per class, and so here the idea is instead of looking at the predictions that the neural network makes we we take a whole bunch of real images and a whole bunch of synthetic images, pass them through a training or pass them through our inception network, get the mean covariance and kind of compare these statistics here, in this case we're looking at the first and second moments and yeah.

A

So an additional thing that we need to do again like looking at nearest neighbors. This is frequently done to measure overfitting nearest neighbors, either in pixel space, which is like something. But you know, nearest neighbors in pixel space doesn't make a lot of sense just because we're in this very, very high dimensional space. You could also look at nearest neighbors in the embedding space from some kind of you know. Other classifier model, for example.

A

Also, human evaluations are frequently used where people will just kind of like you know, get a batch of humans and have them look at these images and say you know, which one is kind of perceptually better to you and yeah. These are some good further readings on this kind of like space of evaluating models. This alert- this is actually a good one. To read. This all again are all against created equal. They basically implemented like a huge huge number of all the different gand training frameworks and we're like hey.

A

A lot of the differences actually just come down to little kind of optimization things, but this is a really good summary and then also pros and cons and Gantt evaluation metrics. They just go through like a huge different set of evaluating things like two minutes left, so I'm going to go through some cool applications, image to image translation.

A

This work here basically translates between different types of domains, so we can go from, like you know, daytime to nighttime or like this sketch to like an actual image of a purse unpaired image to image translation. This is cool, so this was paired, so it means that during training we needed- you know sort of parrot examples of this this one here we just need two different domains, and this is cool.

A

You know we can kind of do this kind of style, transfer, stuff return, zebras into horses, faced a cartoon, another cool thing where we like turn people's faces and look cartoon versions of their faces.

A

Super resolution I think I mentioned this one earlier on, but this is a nice kind of image. Editing application generating molecular, so I've been focusing a lot on images throughout this talk, but there's you know a whole lot of other domains in which you can apply these different techniques and yeah. There we go.