National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 18 - Featurewise Transformations - Vincent Dumoulin

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Welcome everyone we're going to get started. Our first speaker of today is Vincent Vincent completed his PhD in computer science at the University of Montreal, under the supervision of yoshua Benjen and Erin Carville working on various forms of generative models. He is currently a research scientist at Google brain in Montreal, and his current research interests include generative, modeling and meta learning. um Please help me welcome Vincent.

B

Right thanks, everyone I hope, you're having a nice summer school week and you're learning tons of things before we get started. The little icebreaker fun fact about myself during my first year as an undergrad I interned in the solar physics lab at the University of Montreal, so I do have some scientific experience. I worked on a numerical simulation for spectral, solar irradiance in the near and middle for violet and I learned two things, one a few things about evolutionary algorithms and to that Fortran, as their language is still very much alive today.

B

So this lecture is about feature-wise transformations and, as you'll, get to see they're very simple and effective in modulating computation in a neural network. But before we start this lecture, I'd like to be interactive. So if you have any questions, please don't please don't hesitate to interrupt and ask your questions.

B

I do have some amount of material, so I might stop questions at some point, but rest assured I will be available during the coffee break for your questions, if you have them okay, so this lecture is inspired from a distill article that was published about a year ago that sure the same shares the same name. If you find the presentation or the article useful and influential influential to your research, you can always cite it using the bit back.

B

Entry displayed on this slide and at the end of the article itself, there's also instructions on how to cite the article.

B

So it's not straightforward for me to give you one use case justifying exactly why you'd want to use feature wise transformations because, as you'll see, there are many ways in which you can consider what few choice transformations are and many reasons why you'd want to use them. So having a clear and crisp use case for them is a bit difficult for me. In truth, it's closer to an architectural feature than that that is used for various reasons and various problem.

B

Settings and you'll see that feature wise transformations are found in a surprisingly varied number of recent approaches, spending many research areas and we can reason about them from many perspectives. For instance, from the conditional computation perspective, the multitask pointing perspective. Does your shot learning perspective modality fusion? So it's it's a good. It's a good tool to have in your toolbox, and one of my goals in this lecture is to get you to think about and recognize application opportunities for few choice.

B

Transformations another objective I have is to get you to think about learning problems that have a more complex structure than the usual input output structure, found in, for instance, supervised learning, and to think about those problems from different perspectives such that you can approach your own learning problems differently.

B

The perspectives I will present are all valid, but they don't necessarily suggest the same inductive biases or in other words the same architectural features and so I think this is a a good ability to have to be able to navigate freely between these perspectives before I jump. In with the definitions, let's think about some use cases.

B

So one example I like to give when I introduce feature wise transformations, is the example of class conditional, generative modeling, so the best image generation models that we have today, our class conditional, meaning that there is a distinct generation pipeline for each class in the data set.

B

In this example, we have three classes cap dog in airplane, so we have a collection of cat pictures, a collection of dog pictures, a collection of airplane pictures, and then we train our class conditional generative models by training, separate generators, one for cats trained exclusively on cat images, one for dogs and so on and so on and to sample from the model conditioned on a class. Well, we we have some noise input, which we just need to feed to the right generator.

B

So if you want a sample cat pictures, we route the noise through the cat generator and then we get a cat pictures output. Okay, this is all good, but this is not how state-of-the-art models are built nowadays, so we don't have separate generation pipelines for each class in the data set. Instead, we have a single generation pipeline that is supplied with a class indicator in this case cat dog or airplane, and that class indicator should influence the way in which the generator is somehow processes the noise input and produces an output.

B

As a side note, there is nothing preventing us from packaging. These three generators into one class, conditional generator, which trivially routes the noise to the right, sub generator or conditioning, but it's and also it's one of the many instances where I'll tell you in this lecture that if you squint your eyes your eyes hard enough that concept, a really is a special case of concept. B. So we'll see more examples of that later.

B

But it's it's questionable whether we gain anything by doing this, because the three generators don't share anything. So what have we gained really and I see two immediate problems with that one is scaling. So here we have three classes, so we have three separate instances of the network architecture to train in store- maybe that's good, but what? If we scale up to imagenet, where we have a thousand classes, do you really want to train and store separate copies of a network architecture? One thousand, maybe not. Another problem is that of a lack of positive transfer.

B

So what I mean by this is that, for instance, in imagenet there are many dog breeds, so many classes of dogs, and if we were to train separate generators on each of them, we would have to learn to generate fur texture in all cases, so it seems kind of wasteful to be relearning that over and over again and we'd also be losing on tons of useful examples by keeping things siloed.

B

So now it's still an ongoing research question how to construct model architectures that scale well and favor positive transfer, but as we'll see later, the choice transformations represent one kind of approach that is both simple and effective. In that respect,.

B

Already here, I think we can see different perspectives emerge. So, let's look at three of them. One is that of modality fusion, so we can think of modality fusion as having multiple modalities, in this case, a noise input and a class label, and what we want to do is we want to fuse them somehow in such a way that we produce the right output so we're solving a single task for which both the noise in the class label are inputs.

B

From the multitask learning perspective, we tackle multiple tasks, one per class in parallel in a parameter, efficient manner through parameter sharing. So we have one source of noise, which I can serve many different tasks, and then we want to produce the right output from the conditioning perspective.

B

We are processing this noise input in the context of the class label, so the class label here is treated as a site, information channel and there's an interesting asymmetry here, which is that, unlike with the modality fusion example where we didn't have any preferred modality Oh question, considering the real science of the subject, so your question is: do we is it considering? The actual size of the objects it is generating.

B

That's a good question, so there's there's nothing that is explicitly enforcing that. It's only through observation that the generator learns to do that. So so, if, if statistically things have a certain scale with respect to each other, that's what the generator will learn, but there's nothing explicitly enforcing that to happen so back to back to the notion of asymmetry.

B

So we have a preferred modality here, which is the noise input and then the label, as I said, is treated as a site information channel and it it's also used to modulate the computation. As I was saying earlier,.

B

Okay, let's switch over to a new example, visual reasoning. So, very briefly, what visual reasoning is is that we have an image that we have as input here, and we want to ask questions about that image. Any questions that probe the models, ability to reason about the content and relationship of objects between between objects in the image.

B

We can also think about this problem from the three different learning perspectives I just introduced so from the modality fusion perspective, the image modality and the text modality should be fused together in such a way that we get the right answer at the output.

B

The conditioning perspective says that we're processing the input image in the context of the question being asked.

B

So what I mean by this is that there's there's one image as input, but we could ask many different questions about it and our goal is to extract the right kind of information, so the image needs to be processed in the context of that question note here that we're not limited to a finite set of contexts like the finite set of classes in our class, conditional generative modeling example, because for any input image, there's an unbounded number of questions. We could ask about that image from the multitask learning perspective.

B

One question amounts to one computational process. What I mean by this is a question we could ask. I could apply to any input image, assuming of course, that the question makes sense for that input image. So, in a sense, a question can be thought of as a task description. So, for example, the question how many green cubes are. There corresponds to the task of identifying and counting green cubes in an image.

B

A note here, more generally, about visual reasoning is that it's not enough to do well to do well on the previous tasks that we've seen so questions in the training set, because to be useful. A visual reasoning model should be able to generalize to new questions and there is no training data that exists for those questions. We don't have a paired input and output, and so there's a name for that problem. It's called zero shot learning and the reason why it could ever work is because we're exploiting similarities between tasks.

B

So in this example here the question how many green cubes are. There is similar to the questions how many blue cubes are there and how many green spheres are there, because it shares the shape of that question and it shares the color with that question.

B

So if the model is able to answer these two questions, then perhaps it's in a good position to answer that question here, but to build a useful notion of task similarity. We need to build task representations and a task representation allows us to project those task descriptions onto a space in which distances are indicative of similarity between tasks. So this is a form of task representation.

B

Learning as I would call it and sometimes learning the representation for the task is a separate process, but sometimes it is also learned jointly and mostly for the examples in this lecture. As we'll see, it is learn jointly with the rest of the problem.

B

Okay, so to summarize, I've introduced a few different, complementary perspectives on some learning problems and the common denominator for these perspectives is a requirement to combine different information sources, and that goes beyond the usual input-output data processing paradigm of, for instance, supervised learning. So next we'll examine one family of approaches to combining these different information sources, which I call feature wise transformations and we'll examine how these different learning perspectives fit into the framework of feature wise transformations.

B

Okay, so what's a feature wise transformation.

B

In short, it's a transformation on an input, feature vector or a stack of feature Maps that acts independently on individual features or feature map, so I'll get to the feature map in convolutional case in just the next slide, but for now let's concentrate on the vector case. So what I mean by that is that we have an input which is a vector and then we have, for instance we could. We could bias it. So we have a biasing vector and then the operation is applied, element wise on each features, so we're not recombining.

B

Those features, we're not taking weighted sums of those features were, for instance, biasing them individually or scaling them individually. We could also be gating them, so if the scaling is restricted to be between 0 & 1, we have a sort of soft on/off kind of semantics, so that could be achieved by passing our scaling vector through a sigmoidal activation function, for instance, and we can generalize all of these three examples into a notion of feature wise, affine transformation. So in in a visual reasoning, paper I, co-authored.

B

We proposed the name film for feature wise linear modulation.

B

For those of you who are a bit more math oriented, I I know that technically a linear transformation is not the same as an affine transformation, but we couldn't resist having a nice acronym. So please excuse us for that.

B

Okay and and future wise transformations are used quite a lot as you'll see, so it's kind of useful to have a catch-all turn to reason about them, so something that is either scaling, shifting gating anything of that nature. So, in this lecture, we'll use the film nomenclature and the reason for that is because I think it's a useful communication tool. As I said, it's a catch-all term.

B

It allows us to reason more abstractly about these concepts, but please don't take this as me, claiming technical innovation and all of these methods that I'll be discussing because many papers out there, some of which predate the film lament lecture, make use of feature wise transformations. So it's more of an observation on the the fact that these sorts of approaches are pretty ubiquitous, depending on the restrictions that we impose on these film permit parameters, so the scaling and shifting coefficients we we recover different flavors.

B

So if there are no restrictions, we recover just film if the the scaling is forced to be one, we recover biasing. If the biasing is forced to be zero, we recover scaling and then, as I said earlier or skeptic the scaling to be between 0 & 1. We get a so here's an example animation showing how this works. So we have an input vector and then we can control the value of the scaling and shifting coefficients, and you see that they only affect one feature here.

B

So this is why we call this feature wise transformations, okay, so in the convolutional case that I brushed aside earlier, the way this works is that we still have one scalar, but one scalar and out per feature map and and the the reason for that, in short, is because you can think of the convolution operation as involving a feature detector over different spatial positions. So so in essence, the feature map represents the same feature but evaluated at different spatial positions.

B

So from that perspective it sort of makes sense that we would want to scale and shift entire feature Maps rather than different spatial positions. But there are papers out there that untie this scaling and shifting to different spatial positions, and there are situations in which that makes sense. So it's more of a rule to be broken than just a statement about all of these methods.

B

All right, so we introduce feature wise transformations to the model by inserting film layers. So these these things here that I showed in just the last slide into an existing network architecture, so everything's trained jointly. It's just. We update the network architecture by inserting these film layers and we'll call the scaling and shifting coefficients here. The film parameters in our nomenclature and to reiterate film layers here are an abstraction, so they can mean either of feature-wise, affine, biasing scaling or gating transformations.

B

Let's make this concrete. So let's revisit the class conditional generative modeling example, so the feature wise transformation approach to building a class conditional model here would be to start with an unconditional model and then turn it into a a class conditional model. So just reminder of the problem why we want to do this is because, if we are to learn separate generators for each class in the data set, we have an explosion in the number of trainable parameters, and we want to avoid that.

B

So, as I said earlier, we take a base, generator architecture. We insert film layers throughout the architecture and then we learn separate sets of some parameters for each class. So we have a cat set of film parameters.

B

We have a dog set of film parameters and airplane set of some parameters, and then the way in which we conditioned the model is that we have our noses input and then, if I condition say on the category category, cap I will take the cat film parameters, use them inside of the network and then feed in the noise get the cat as output.

B

Okay, let's Quinn our eyes again. There are different ways of explaining what we're doing here. One interpretation is that this is a fancy and come back to way of describe severals class-specific generators would just which just happen to share most of their parameters. So, in other words, we have different generators here that are not explicitly represented, but we can think of them implicitly, and so they share all of these parameters here and they specialize these sets of parameters in the network, and so what's really going on when we're conditioning.

B

Is that we're implicitly swapping out class specific generators? So that's one interpretation. Another interpretation is that this is a special kind of a hyper network. So for those of you who don't know what a hyper network is it's a network which predicts the parameters for another network, so what's going on here, really is that we're we're taking the cat class we're feeding it through our hyper network, which will predict the value for the film parameters, and then we can feed in the noise and get the output as a result.

B

So let's make this even clearer by revisiting the visual reasoning example.

B

So one way to use feature wise transformations for this is to start with a convolutional Network. So we're still trying to map an input to perhaps a distribution over possible answers and like what the class conditional generator will insert film layers throughout the architecture, but here, instead of learning a separate set of film parameters per class because, as a reminder, there's an unbounded number of questions we can ask.

B

So we can't really learn a separate set of film parameters for each will will learn a mapping from the question itself to the value of the film parameters. So here I could use, for example, a recurrent neural network to map the the question to the value of the film parameters and just as a side note, this is far from the only approach to visual reasoning out there.

B

There are some approaches that incorporate additional inductive biases, such as a notion of modularity in the visual pipeline like I, want to be composing, explicit blocks of computation or are they can also Inc operate a notion of relation between objects? So this is. This is not the only way to solve the problem, but it is a way which doesn't have a lot of inductive bias suspected to it.

B

Ok, let's move away from problem specific details and consider an abstraction that fits both the class conditional, generative modeling problem in the visual reasoning problem. So in both cases we have two components to the model architecture. We have a task solving Network being modulated, which we'll call the film Network, and we have an auxiliary Network, which Maps a task description to a set of modulation parameters which we'll call the the film generator here. So the film generator can be really simple or complex.

B

In the class conditional generative modeling example, we saw that it was just selecting parameters, so you can think of this as we're building a big matrix of film parameters. The rows represent different classes and the columns represent different scaling and shifting coefficients for different features or feature maps, and so the film generator really is just selecting a row in that matrix and using the parameter values inside of the network.

B

The generator can also be more complicated, such as in the visual reasoning example where we have an explicit mapping from a question to parameter values: okay, so if you're still thinking about the different learning perspectives I introduced earlier, you may have recognized that this is leaning more towards the conditioning perspective than the modality fusion perspective, because there's a clear asymmetry in how we handle the two modalities, and this suggests a different kind of inductive bias.

B

So one modular one modality is- is influencing how another one is being processed as opposed to the two modalities being processed in parallel for some time and then aggregated for further processing, as we continue in the lecture, I'll become less and less perspective agnostic and rely more heavily in the conditioning perspective, and the reason for that is because I think there are interesting observations and insights to explore here. But in my that there is merit to all perspectives, and sometimes the modality fusion perspective is more suitable.

B

So, for instance, what happens if we have three modalities which we want to combine together? How would we use the film framework for that not clear or what happens if I have two modalities, for instance audio and video, in a video clip which don't have a clear conditioning relationship? Once again, it's it's a bit harder to think about this from the conditioning perspective.

B

Okay, at this point, I think it would be useful to over to go over a few example. Applications and the goal here is to help you recognize applications of feature wise transformations in the literature and as you'll see, they come with many different names. So it's good to be able to connect them back to a more general framework and I also want to get you to draw parallels with your own problems.

B

Perhaps you'll recognize something useful for you and, as such, I'll be focusing on showing different aspects of feature wise transformations, so I won't be necessarily thorough in citing all of the relevant literature. I won't be doing justice to the full history of application of feature wise transformations and in the distal article. We also have this this sort of recency bias, but we do have bibliographic notes that attempt to be a bit more thorough in pointing to older, related work. So if you're curious about that, I would encourage you to check out the de Soto article.

B

Okay, let's start with a straightforward one. So this is the visual reasoning. Example I gave earlier, and- and this is the paper in which we proposed the film, the language er, and so there's not much else to say about this. So in this specific case we're inserting film layers in the residual blocks of the network. But aside from that, I think you get the you get the picture.

B

Feature wise transformations have also been applied to style, transfer and we'll see three examples of that, so small primer on style transfer, if you're not familiar so some approaches to style, transfer, train a feed-forward network that is specialized single stamina. So the idea here is I have a style image. I have a content image I want to produce a pastiche, a stylized version of the content image in the style of this image. How do I do that? Well, I specialize.

B

My network I fix the style image, I specialize my network to to that style image and then I train a network which is able to turn any input content image into its best dish of that specific style. The the loss formulation is unimportant for this discussion, so just know that there's a way to provide a training signal that encourages that, and here we're also faced with an explosion in parameters, because, if I train a special network for each style, image and I want to build a system that is a collection of different style images.

B

Well, I need to train separate networks for each each style image. So again, the naive approach that trains, one network per style image in this system, leads to an explosion in the number of parameters that you need to train and so again the the feature wise transformation solution to that is to turn the network into a style conditional version of itself by inserting film layers in the network architecture. Okay, so let's examine these three examples.

B

The first example is a paper I authored, where we were considering a setting in which we have a finite collection of style images. So this is very similar to the class conditional generative modeling example, but instead of having classes to to model, we have different styles to model and and the intuition behind that, or really was that we wanted somehow to compress these many different models into one style conditional model.

B

We want it to be more parameter, efficient and the way in which we do the conditioning is by specializing instance, nominalization layers in the network to each style, so small modem on that normalization layers are a natural location to use feature wise transformations, because so just small reminder on on normalization layers. So you you take your input, feature vector or stack of feature Maps, you normalize it with respect to the whole batch or spatial axes or whichever variant you want, and then you have a scaling and a shifting operation afterwards to control.

B

So the features are centered they're standardized, but now you can rescale them. You can ship them, so we already have in normalization layers these sorts of feature, wise scaling and shifting. So it's a natural place to be putting in these film layers.

B

So in this paper we call this conditional instance normalization. So you can think of this as we're swapping out instance, normalization layer parameter values depending on the style that we're using and yeah. That's that's. Essentially, that's essentially it about that method.

B

Okay, so we can extend that pretty naturally to the zero shot. Setting so say, I have a new style image. My system wasn't trained on it, I'd like to be able to do well on that. Well, as we seen with the the visual reasoning example, we can do so by predicting film parameter values from the task description. So, in this case, our task description is the sound image. Well, why not introduce a mapping from style image to film parameter values, so this is what was done in this paper and.

B

So briefly, the way this works is that we have our input, content image that we'd like to stylize. We have our style image and then we we pass it through.

B

The film generator predict a set of film parameters to be used in the main style transfer Network, and then we feed in the content image, and then we get the output- and this is a small animation here to show you what happens if you V feed in style images that have not been seen during training and as you can see, the model does a pretty good job at zero stretch. Generalization in this case.

B

Another related example that you might see cited in other papers is a des in, for adaptive instance normalization. This is yet another name, but the principle stays the same. We specialize incidence, normalization, parameter values to different style images, and it also allows you a generalization. What I think is interesting in this work is the way in which the film generator is constructed. So in in the previous work, we had a explicit mapping from style image to the value of the film parameters here.

B

What's happening is that we feed both the content in the style image through some encoder that is shared between the two and then when we want to perform instance normalization. We normalize the content, stack of feature maps and then we scale it and shifting, according to statistics collected from the style, stack of feature maps and- and that is sufficient in their case, to allows your shot generalization. So if you squint your eyes again, the film generator is a heuristic that reuses the encoder networking and performs a fixed processing on this town stack of future maps.

B

So this is just to say, there are many ways in which you could design this film generator.

B

Okay, this problem setting should be familiar to you. We're talking about class conditional, generative modeling has Emily discussed began yesterday, maybe maybe not yes, I have a yes good, so I won't spend too much time talking about it again. This is a pretty impressive model that came out recently and it uses conditional batch normalization for class conditioning which you'll recognize as an application of each OS transformations.

B

It also incorporates a twist on how the generator uses the input noise, so a low dimensional embedding is learned for each class and then the the way in which we predict the film parameters is that so first we take the noise vector as input and we chunk it into different different chunks. So this is what's in here, so we take that as input. We split it into different chunks, which are our here, and then we concatenate the noise chunk to the class embedding, and then we linearly project that into some parameter values.

B

So one way to see this is that noise is being injected at multiple locations in the generation process. Another way to see this is that we have noisy film parameter values and, and that, as you can see on, the right allows to generate pretty convincing images. Of course, this is not the only ingredient in there that makes it work. They scaled the architecture up by quite a lot scaled up. The batch size as well, but but feature wise transformations are at the core of the class conditioning mechanism.

B

This work: okay, let's travel back to the ancient times of 2015-2016, which by deep-learning years, is really really old. Apparently, so this again is perhaps the most cited Gantt paper after the original yawn paper, because it introduced a convolutional architecture that worked much much better than the fully connected architecture in the original Gantt paper and the condition on the class by concatenating the class label two stacks of feature maps in the network. So it's a slightly different mechanism.

B

They do conditioning by concatenation rather than by biasing, but the reason why I point out this paper to you is because there's an equivalence between concatenation, based conditioning and and conditional biasing, and the reason why so think of think of this example here. So we have our input here and we have the conditioning signal here. We can catenate them together.

B

We multiply that with a matrix, you can always decompose the matrix vector product into two smaller products, one with just the input and one with just a conditioning signal, and afterwards you add the these two resulting vectors element wise, and what you can recognize here is that you can think of this as a conditioning bias. So really concatenation and biasing are essentially the same from that point of view.

B

Why, while we're on the topic of Ganz, here's another recent architecture which made quite a splash, it's called style, yawn and those are actual model samples, not they're, not real pictures of faces. So it did very well on that task, and one of the central principles here is that the input noise vector is it isn't even fed as input to the generator.

B

So instead we learn a fixed constant to serve as the input of the generator and the way in which the noise vector intervenes is purely by predicting the value of film parameters to be used. In these adaptive instance, normalization measures which we've seen earlier so.

B

So that, but that is another twist on how to use feature wise transformations and and they they even removed the input from the generator in that Chase. Okay, so switch. Let's switch gears a little here's another cool idea, so say that you want to deploy a model on devices that have different amounts of processing power. You could train separate networks with different widths for it, for example, to have different kind of computational expenses.

B

If you will one for each desired level of computational effort, but as we discussed in the context of class conditional generative modeling, this is wasteful, and so instead it we. It would be better to train a single network and use only the neurons up to a certain width in the network. So you can deploy only one model and then you can select the width that you want to use that corresponds to the computational effort you want to make.

B

It turns out that by specializing batch normalization layers to each Stargate with which they call switchable batch normalization in the paper, you can do just that. So the idea is that we learn. So there are Bachelor mobilization layers in this model architecture and then we'll learn different sets of parameters for each width, and it turns out that you can get good good accuracy and computational effort trade-offs, and you can. You can select the width at at deployment time.

B

So I think that example illustrates really well the concept of compressing many models into one using pitch wise transformations.

B

Okay, so far, we've only covered image, synthesis applications but feature wise transformations are also used in other research areas. So here's the reinforcement, learning example and the goal here is to do instruction following so we have an agent here which is navigating in in a do like environment, and we have an instruction and we expect the agent to carry out that instruction and the instruction is in the natural language. So in this example, the way in which the policy network is conditions is that we predict feature wise, gating parameters from the instruction.

B

So we have the instruction as input we process it. And then we predict gating parameters to be applied to the output of the visual pipeline part of the policy, and in that paper it is sufficient to be able to the instruction following here. We have a language model where one half of the feature vectors is used to predict how to get the other half. So in this diagram.

B

It sits right here, so we we predict, feature vectors at every time step and then we take one half this half here yep and then we pass it through a sigmoidal non-linearity, and then we multiply it element wise with the other half.

B

So the motivation in this paper has more to do with avoiding gradient vanishing. Then it has to do with using few choice, transformations but I think it's it's illustrative of another interesting concept here, which is that of self modulation. So, in the examples I gave previously, the conditioning signal was always external.

B

It was a site information channel, but there's nothing that says that this conditioning signal has to come from an auxiliary information source. It could very well be coming from another part of the network, as in this example, and we'll see other examples of that. So here's another example.

B

So this is the submission that won the first place in the image net. 2017 challenge, and one of the central ideas in the model architecture is that they have a pathway branching off from the main pathway, and that is predicting feature, wise scaling parameters to be applied onto the main pathway and and that architectural feature, along with the other improvements suggested in the in the paper, helped win this image. A 20-17 competition.

B

Okay, so somewhat related to self modulation is the idea of adapting to changes in the input distribution. Oh here's an example illustrating that here we have a speech, recognition model and we we want it to be robust to different speakers and to different noise conditions, and so the way in which this model does it it that it first builds a representation of the full utterance and then it uses that to predict layer, normalization parameter values to be used inside of the linear layer, normalization layers.

B

They call that dynamically your normalization in this paper, but again it's an instance of feature wise transformation and here's. Another example of that. So the this notion of adapting to different input distributions using few choice transformations. We also find it in fuchsia image classification. So, very briefly, what is few shots image classification, so few shot. Learning aims to learn from.

B

There are a few labeled examples, so we we seek to mimic the the human ability to learn from just a few examples and to simulate that in fuchsia learning we build what's called learning episodes where we have a learner that sees a new small training set for a new learning problem and is expected to perform well on a held out set of examples.

B

And so we, the the the actual training loss, is not how well the learner is doing on the training set.

B

It's how well it's doing at generalizing on a test set, and there are many many learning episodes like this, which are formed artificially in order to encourage that and we optimized the whole process again such that when you train on the small training set, you generalize well on the hell out set and in this paper, oh yeah, and one of the problems that you can have with this is maybe the learner sees a completely new input distribution, so images, for instance, that, unlike it has seen before some different statistics, and in this paper they proposed the idea of first predicting film parameters, value parameter, values from the training set and then to use those film parameter values inside the main learner in order to be more robust to changes in input distributions.

B

Okay, okay, so hopefully these examples give you a good taste at the variety of settings in which feature wise transformations are applied. Now, I'd like to focus on interesting properties of feature wise transformations themselves and I'll start with a disclaimer. Some of the interpretations that I'll be making in this part of the lecture are more speculative than factual by nature. So please do exercise critical thinking assessing what I'm saying here.

B

Okay, one interesting consequence of film is that we end up predicting a low dimensional vector of film parameters, and so to give you an example in the style transfer model, I was working on the the number of film parameters accounted for about point. Two percent of the total parameter count in the in the whole model. So that's not a lot of parameters to be modulating to have such a drastic effect on the computational properties of the model, so squinting our eyes again. There's at least two possible interpretations to this.

B

One from the computational perspective is that film parameters are an instruction on how to modulate computation in the task solving Network. Another interpretation from the representation learning perspective is that film parameters are a representation of the task description. So, let's focus on the representation learning perspective for an instant so assume we can extract a representation from our task descriptions. What can we do with that representation?

B

If you recall the stands, for example, the comma, the common denominator was that we associate film parameters to individual style images. So you can, if you think, of the style image as a stylization task, then the film parameters are a bit like it's style representation.

B

Now my background is in generative modeling and we have a cliche, which is that we love to do interpolations in latent space. So we love to interpolate Victor's in latent space and see things very smoothly in pixel space, and this is a bit like when the neural networks community was fixated and showing wait filters for a while. So we have that vishay, I guess and of course, when I was working on south transfer I had to interpolate film parameters.

B

So here's what's going on in this animation, each individual image on the Left corresponds to a different style that the model was trained on and to the right. Here we have the feed of a camera, video feed of a camera that is being fed through the style, transfer network frame by frame and then we're using the film parameters that you know describe how we get them afterwards and the to stylize a simile. And so how do we get these film parameters?

B

Well, we can take a convex combination of different film parameters associated to different style images and what you see when we're doing that is that we're varying the style smoothly, so we're transitioning smoothly from one cell to another.

B

There's a pretty clear use case in this example: it's a more natural way of interacting with the model. It allows users to blend Styles together to express our creativity, but I think that I'm at a more abstract level. What we can conclude from that is that interpolating between tasks refers representations leads to meaningful changes to the task, solving networks, computational properties. So this is not a property that should hold in all use cases, but in this instance it does, and it also does in this example.

B

So this is Big John again and as a reminder, this is a class conditional generative model where we're conditioning by a film layers inserted throughout the architecture, and so the and the film parameters are predict predicted linearly from the concatenation of the input noise and a learn class. Embedding a consequence of doing things. That way is that we now have a class representation for each class and we can think of interpolating between these class representations.

B

So we can probe what's in between categories by taking linear, interpolations of class and Eddings, and once again you see smooth transitions. It kind of reminds me of the old Animorphs TV show for those of you who know what I'm talking about and what's interesting here as well, is that the the post remains roughly the same in those examples which maybe suggest that there's a nice separation between what's encoded by the input noise and was encoded by the class embeddings.

B

So they said bility to interpolate between tasks sighs in light nicely with the brother theme of what meaning to assign to hidden layers in a neural network, as you've probably been told. One interpretation is that a hidden layer is an abstract representation of the input, but an alternative interpretation which I first heard from Ian Goodfellow, but almost certainly it's been put forward by others- is that hidden layers amount to intermediate stages in a numerical program.

B

So what this suggests to me is that feature wise transformations may be a mechanism through which a model learns and composes computational primitives, and this interpretation is closer to the computational perspective than the representation learning perspective. But as you've noticed by now, moving freely between interpretations is more or less a theme of its own. In this lecture, so I think this is a good thing.

B

Ok, beyond interpolation is another operation that the deep learning community loves to apply to representations is analogies. So here we have the classical word to Veck example. So in word two that we learn and embedding forwards in a vocabulary, and then we perform analogies. So the classical example is King. Man plus woman was Queen, and so, if we take the word embedding for the word, King subtract, the one for the word man add the one for the word woman low and behold we get the embedding for the word. Queen terms and conditions apply.

B

You have to exclude the query words here for that to work, but you get the idea. So in our visual reasoning, work we were curious to see if that property also held for the learned task representations. So in this example, the question is what is the blue big cylinder made of, and so the usual way of operating the model is that we feed the question through the film generator. We use the output film parameters in the task solving Network and we feed in an input image.

B

We get a predicted answer as output in this case rubber, which is not the right answer. It should be now, and perhaps this is because the question involves concepts that have have been seen in training, but in nice elation never together before so the model doesn't really know what to do with it. With that new question, but in any case what happens here is that the model fails to generalize to a new question.

B

So taking inspiration from the word cubic examples, an alternative way of operating the model is to express this question here as a combination of three questions that have been seen during training. So the question: what is the blue big cylinder made of can be thought of as what is the blue big sphere made of plus? What is the green big cylinder remain of what is the green big sphere made of so in?

B

In on one hand, you have blue +, green bleens, green, so blue and on the other hand, you have sphere + cylinder sphere, so you have cylinder. So we symbolically recover the the question.

B

So first we feed all these three questions through the film generator. Then we combine their film parameters in the same way and it turns out that this works for visual reasoning or for the visual reasoning model. We were studying not always, but it's still corrected a non-negligible amount of errors that the model was making. So I think this is an interesting observation and it's own, but it also raises interesting questions about failure modes for this type of architecture.

B

More generally, the film sort of framework, so knowing that we can recover the right answer through algebraic manipulations of film parameters when a model makes a mistake, is it because it lacks computational primitives to solve the task? Or is it because it failed to combine them appropriately and I?

B

Think there's there's an important lesson too here, which is that, like fee for network stenograph it on the training set task conditional network scan over fit on the training set of tasks and overfitting was in fact one of the major challenges in in this work and reducing overfitting required being very careful in choosing the capacity of the film generator.

B

So if you're ever working with these types of architectures know that there are many sources of overfitting and and know that you can target different parts of the architecture to combat that another lesson to be learned: oh yes, I! So so I! It's a good question, I! Think! If you're, if you're primed on the problem, you could probably answer that question correctly. So here the the the rubber objects are never shiny, so they're there math, and so so. This is a way in which so this is this is an artificially constructed.

B

Setting I'd say this specific data set clever was was built because it was observed that, in visual reasoning, the models turned tended to rely on biased season the data set rather than on actual reasoning capabilities. So, if I asked you, if you see a green pasture and I asked you what is the animal, you probably think this is a cow, and this is exactly what the model was. The models were doing. They were not answering based on the content of the image, but just based on the statistical similarities.

B

So the goal for that specific data set was to reduce these sorts of biases that the model could rely on now. Is it? Is it a? Is it a fair task to be achieved by humans, I? Think humans were scoring pretty well when they were evaluated on that I'd have to look again. Maybe it's that specific example, that is, that is confusing, but generally we're pretty good for that Addison as well.

B

Ok, cool! Thank you. Thank you for your question.

B

So the other lesson so again getting back into into the context. So why are we waking making mistakes and and what lessons this can can tell us. So another lesson is that the film point of view I think, highlights a separation between the various computational primitives that are learned by the film Network. So these are the computational primitives. If you will, and the numerical rest recipes, that we learn through the film generator and how to combine those computational primitives.

B

So the models ability to generalize depends on both its ability to parse new forms of test descriptions, but also on having learned the required computational primitives to solve those tasks.

B

Okay, I'd like to conclude this lecture by discussing notions of interpretability I'm by no means an expert on the topic and luckily you'll get to hear from an actual expert on the topic right after the break. But I couldn't pass on the opportunity to discuss what sort of interpretations we can make in the context of a choice transformations so when I was playing with conditional instance, normalization in the sound transfer work.

B

One of the things that surprised me very much is that, despite their simplicity, feature-wise transformations are very effective at modulating the computation in a network again point two percent of the total parameter count that we're varying that is sufficient to cause drastic changes in the computational properties of the style transfer network and it's been I. This is the hammer I've been hitting every nail with ever since and it's it's been pretty effective in many different problem settings.

B

So to me this raises the question: how can so few simple interactions compound into meaningful modulations of the task? Solving Network- and this is something which I've always wanted to explain.

B

So, let's speculate a little with what we've been discussing so far regarding learning and composing computational primitives, one hypothesis is that we could make. Is that feature wise transformations are sort of a selection mechanism for those primitives.

B

So, for instance, we could think of feature wise transformations as a mechanism for shutting up features or feature nots, and and why would that be a selection mechanism? Well, if you take a feature and you zero it out, it's as if you had never computed it in the first place, so you could do it by setting the scaling to zero and the biasing to zero as well. You could do it by using a very negative bias.

B

So if your, if your operation is followed by some non-linearity, which zeros out things that are negative like rectified in your unit, for instance, that could be that could be used. So maybe there is a mechanism like that at play here, and it sounds like a wonderful interpretation, but in practice it doesn't hold.

B

Unfortunately, I don't have fancy plots to show for that, but here's what I can tell so for the distal article, we looked into sound transfer in visual reasoning models, the so the the one for which there the we were predicting some parameter values from the style image and the visual reasoning model I was presenting as the first example in the example section, and both of them use a find variance of film, and we look for evidence in favor of this selection mechanism hypothesis.

B

So one of the things we would expect to see if we're selecting features left and right is that applying film should lead to a non negligible amount of future maps that are turned off. However, when we measure the sparsity, we found it to be very low, I think it's at most 10% of the feature maps that were completely off at any given time and in instances where the feature maps were turned off.

B

We weren't able to identify a unique mechanism which would explain why they were turned off, so things were a bit all over the place. So sadly, the conclusion we drew from this is that there's no supporting evidence for the selection, some hypothesis and there's apparently no unique mechanism by which film can turn off features or feature. Maps I, wouldn't completely rule out a modified that a modified version of the hypothesis could offer an explanation and also keep in mind that the number of models we examined is extremely small.

B

We're talking about two models which use the affine bearing and so there's tons of things to explore here, but to this day I. Don't think we have a good explanation for how feature wise transformations are somehow able to compound into meaningful modulations of the test. Solving Network the consolation is that there are some interesting things we can say about the way in which film parameters cluster across test descriptions, and even though this has not yielded a lot of insights yet into how film operates at a computational level.

B

I think this has the potential to help interpret the purpose of different parts of the task solving Network. So let's take a closer look at that. What you see here is scatter plots of film parameters, so this is for the visual reasoning model. This is for the sound transfer model. The x axis corresponds to the scaling coefficient. The y axis corresponds to the shifting coefficient and then each point corresponds to a different task description. So here it's a different question here: it's a different style image and then also a different feature map.

B

So to help visualize things we color-coded things by feature map, so all points of a certain color would correspond to different questions and their associated film parameter values for a specific feature: map. Okay,.

B

So mark marginally across all feature maps. This is a big mess, there's not a lot of structure. But, however, if we, if we look at individual feature, Maps so individual colors here we see that there's a fairly simple structure so for most feature maps in both models. The points tend to question into a single, dense blob and for the visual reasoning model. They occasionally form more than so wait until we find one. So this one this one. So there are a few examples like that.

B

You can see I'll expend more on that later, so thinking back about style interpolations. Maybe this explains why they don't produce garbage results, because any convex combination of some parameters is likely to correspond to a meaningful parameterization of the task, solving Network, so the network kind of knows what to do with that. So that's that's reassuring.

B

Also, looking at the axis alignment of those different blobs, we see that some clusters tend to vary across the scaling axis like in this example, or some others tend to vary across the biasing axis and some tend to be diagonally aligned. So my interpretation of this is that the model has learned to modulate feature maps in various ways, sometimes by a scaling, sometimes by a biasing and sometimes using a combination of both.

B

So perhaps when we were looking for a unique mechanism that explained how film modulates it was a bit pointless, there are many mechanisms, also keep in mind. This is for affine transformations, I'm sure there are other interpretations to make in the case of gating, for instance, or other types of mechanisms.

B

Okay, so those film primary scatter plots hint at the fact that the way in which film parameters are structured across tasks varies depending on the problem setting. So this is what I was sending at earlier when I said that for the visual reasoning example, we see more than one modes which we're not observing in in the case of the style transfer example.

B

So this means that perhaps we shouldn't expect to find a unique in problem, independent explanation for the success of feature wise transformations, but still there are some interesting things we can say when we look at this individual problem, so I am showing you two cherry-picked examples of these multimodal blobs that we found in the visual reasoning model and in in the distal article. We hypothesize that maybe this the compositional indiscreet nature of visual reasoning that requires these different, well-defined modes of operation that are not necessarily as much required in the San Fran's, for example.

B

So moving forward, we can try to infer how questions regroup in into different feature maps. So here we color-coded the scatter plot by question type. So this is one feature map. This is another feature map and then so. This question type information was is metadata that is available in the clever data set that was used to train this model, but the model wasn't fed this information. This is just post hoc analysis and what we see here is that sometimes there's a clear pattern that emerges like in this right plot here.

B

So if you look at color red a color related question types, they tend to be found in the top right lustre. So let's see equal color and then let's look back so.

B

Where you color as well so sometimes there's a pattern like that, what's behind it good question, but it gives us some clues, sometimes it's harder to draw a conclusion. As with the left lot, so things seem to be just all over the place with no clear structure. So in that case we can turn to question content itself. So what we did here. So this is the same plot as this plot here and then we just switched over to a new feature map for illustration purposes on the right here.

B

Ok, so what's happening here- is that we're just putting more focus on questions that contain these words, so metallic shiny chatter, etc?

B

Okay, so let's look at the left lot, especially when we hover over the words front and behind so in the left lot. Things tend to distribute fairly uniformly across clusters, except for front and behind where we seem to have a bias towards the left or the right. Cluster I can't tell exactly what's going on again in that feature map, but it does look like something different happens when we're conditioning questions relating to relative depths in the scene. So is it possible that the computation it carries out somehow pertains to that notion of relative depth?

B

I think that could be an interesting thing to to investigate now if we look at the right plot, when I over I hover over material related word, so material or rubber mat and then metal, metallic and shiny, so you see that there's again, some kind of left/right bias, depending on whether we're talking about mat and rubber-like material or or shiny or metallic, related material.

B

By the way, these plots can all be interacted with in the distal article in case you're curious, and you can also hover over points, and it is so article to see what the questions are. So the examples I showed are but your dick for clarity, but in truth we didn't search for them very hard before we found them. This is the sort of these sorts of patterns have tend to happen all over the place.

B

So again, looking at scatter Plus doesn't even come close to fully explaining what's going on, but it's I think it's raising interesting questions that could be followed up on taking a step back. So, given that both the star transfer and the visual reasoning example seem to exhibit a lot of structure in the way in which film parameters cluster, we can think of doing 2d projections to try and see how questions cluster on the 2d plane.

B

So for the the still article we used a technique called tease me I won't go into too many details about teasing itself, just know that it's a projection technique that tends to maintain structure between between points and if, if you're curious, to know more know more about teasing itself, you can have a look at the distal article on TCE. I think this is a great resource on on how to use it. Well, so here's the Disney projection of some parameters for the visual reasoning model.

B

Once again, each point corresponds to a different question and we color coded by question type. So overall questions tend to cluster fairly well by question type. Sometimes you have some isolated clusters that look a bit weird. So if you look at exists right here, it's clusters with equal, color, okay, and so this doesn't appear to make that much sense at first glance. But then, if you look at what are the questions in here? These are questions about the existence of objects that have the same colors.

B

It's a combination of both of these concepts, so it's kind of reassuring to see that things tend to do it to make sense.

B

If we're looking at the TC projections for our style transfer, Network example again, each point corresponds to a different style image: the color code by painter and things don't cluster as nicely, but maybe this is to be expected again style transfers. The problem is perhaps as compositional, then visual reasoning and so I have a fun anecdote to share about this. So when we were running the TC projections, we did many many runs and one of the clusters that was especially robust to different runs.

B

Was this one here which, if you look at the the painters that are grouped together, it doesn't seem to be making a lot of sense, so I hope I'm, not butchering his name but sheesh, King and and how long are clustered together? And if you look at their paintings they don't look very much the same so I. My knowledge of art.

B

History is pretty embarrassing, but I do know how to Google so I'm, proud to tell you that I learned that day that Amazon is well known for his portraits, among other things, and that chicken was especially good at forest landscapes, so not exactly the same style. So we were scratching our heads in Medan.

B

We thought our projection was broken somehow, but then, when we looked at the actual images that were clustered there, they're all sketches, and it turns out that both made sketches that are found in the data set that was used to train this model. So.

B

I'll be honest: we're not winning a Turing Award anytime soon, with this observation, but I think it's a fun anecdote to share nevertheless, kind of shows the sort of surprises or insights that you can gain by looking at these film parameter values and how they cluster across different axes.

B

Okay, so I was pointing out all of these examples. To give you a flavor of the sort of analyses, one can run on unfilled parameters of train test conditional models in this lecture. I voluntarily refrain from speculating on psych scientific applications because, frankly, I think you're in a better better position. To judge of that than I am what I do hope is that by showing you different learning perspectives and several application examples you'll be better equipped to draw connections to your own research problems.

B

As a closing remark and a good segue into the next lecture, I think I'll highlight again the topic of interpretability, so for some applications, it's sufficient to treat deep learning as an engineering, discipline and I. Don't think I have to tell you that for scientific applications it is.

B

It is not it's important to treat the models as more than just black boxes that map from an input to an output in that regard, feature wise transformations, I think operate at a level of granularity that is well suited for performing analyses because, as the name implies, they act at the feature level.

B

So, as we saw when applying feature wise transformations in the task conditioning setting the post hoc analysis, we did can give us a sense of how tasks cluster in their modulating effect on individual features, and we can use this information then to formulate hypotheses about their purpose.

B

Of course, this is still a very manual labor, but I think there is great value in pushing those ideas forward and and automating that process somehow I think that could be very beneficial for interpreting the models and personally I feel like I've just scratched the surface when it comes to interpreting feature wise transformations, an area where I think there is great potential personally, if I had to make a prediction is in the area of performing interventions on the model after it is trained and asking counterfactual questions about the model.

B

So, to give you a really small example of that think back a visual reasoning and think back up the sorts of questions we could ask the model. So we could ask a question like how many blue cubes are there, and then we could artificially intervene in that question.

B

Changing maybe just one word, for instance, asking asking about red objects further than blue objects and then looking at the differences in the statistical properties of the activations at different places in the network, so I think these sorts of counterfactual interventions could could perhaps give us more insights into how the model is operating.

B

Okay, I think this is a good place to stop. So thanks a lot for your attention and once again, if you're curious to know more about feature wise transformations, you can look up the distal article I also included a list of references at the end of these slides for papers. I mentioned in the lecture, so I think we still have perhaps what 10 10 minutes for questions and, like I, said I'm there at the break. If you have other questions, so I'm ready to take your questions, if you have them.