National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 06 - Introduction to Neural Networks II - Mustafa Mustafa

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Okay, so yesterday we talked about how to build neural networks. We saw examples of that and we so also the echo system or frameworks like tensor go with the API of Karass. We did talk generally about you know, neural networks. Are these complex functions that we try to use to approximate relationships that we have in the data, and then we talked about how we we find the weights of those functions or we how we optimize them, and we also mentioned how to monitor the learning process. We mentioned the difference between optimization and learning.

A

Our aim is really to generalize the performance beyond the training data set that we're looking at today, you will focus mostly on how we actually do that process in practice. How do we improve the generalizability of our network? So we'll talk about a few hairy issues that come in in practice. I think if you remember Brenda yesterday said that one thinks that we spend most of our time trying to find the best architecture for the problem that we're working on. But that is really not not what we yeah.

A

We spend most of our time actually trying to make these networks work. Even after you find that architecture or you find some architecture that makes sense, you need to get it to work, and this is not simple so before I get into the details. I want to remind you again that none of you know events like this are sufficient to actually get a grasp on all the nuts and bolts of actually doing deep learning in practice.

A

I strongly recommend, looking at an undergraduate level course such as yes, 231 and has great lectures and I'll be using a lot of these lectures. There are books on the market that also are very helpful. There are academic, looks like they're discerning by a and Goodfellow, and this I'm sure you have seen this this book before it's more of an academic nature, and then there are practical.

A

That josh has also talked about yesterday. Okay, so why? Why does it matter that we actually get a grasp on how to train your networks? Because we do think that the future will the future of software engineering will feature a lot of deep learning and neural networks. We have already started seeing it terms like software 2.0, which is essentially building software, that is powered by neural networks. We talked about deep learning based systems where you have your entire pipeline of software or entire stack composed of multiple parts and those multiple parts.

A

Many of them are neural networks, so we do expect that in the future we want to actually be able to to have a development cycle, that is, that a disciplined development cycle for for this software, this type of this type of software. So what characterizes software 2.0? It's essentially the fact that we use it. Okay, let's step back and think of what software 1.0 is. We always talk about this rule-based sort of software right.

A

Your ball learn the f else for loops and recursion, and all that sort of algorithms that we use to build traditional software so and the way that we achieved tasks were using traditional software is, we think about the actual problem that we're looking we're looking at the word time to solve, and then we come up with an algorithm. We can even write a flowchart for that algorithm and that flowchart is rule based right.

A

There's, a deep learning provides a different way of building software, and that way is let's collect some data where, in that data it exhibits the the sort of relationship or the task that we want to achieve, and then we use gradient descent to find the best software or the best version of that model that can achieve the tasks that we're looking at.

A

So the optimization space of software like this, this would be software 2.0 just come up with an algorithm at one point: software 2.0, you start with at some point and then use gradient descent to do an optimization process to get to the software that you want. We can actually achieve much higher complexity.

A

Software you've already seen a lot of tasks yesterday that were displayed. There are extremely difficult to actually come up with algorithms to do to do those tasks or perform those tasks with a rule-based sort of software and as Andrew hypothesis gradient descent can really write code better than you do. Okay. So how do we do this in practice? You've seen xkcd like this goes like this. Is your machine learning system yup? You pour the data into this big pile of linear algebra and collect the answers on the other side.

A

What if they have, the answers are wrong. Just turn the pile until they start making sense. Okay, so it's of course this is, you know kindly facetious, but in practice we know we need to do a lot of staring, but this Turing is not it's not easy. You will see a lot of these issues today. This is where you're going to talk about I'll talk about data normalization briefly.

A

This is an important topic that sometimes it's necessary, we'll talk about learning, great DK and then we'll spend some time talking about regularization after that move on to talking about debt and then I'll try to finish with a couple of things that are, we use in practice, things like transfer learning and some practical tips, I hope I can get through all the slides, so normalization I think this has been mentioned yesterday. A couple of times and you've also seen it in in the practical sessions that we usually normalize.

A

The data set the features of the data set, and their idea is that the different features that you have in your data set could have artificial scales that you don't want your neural network or your model to spend time trying to figure out what scale this feature is, especially if the scale is irrelevant, actually yeah I'm.

A

Most importantly, so, if you're looking at data like this one way of doing normalization is to move your distribution of your data, the mean of that distribution to zero, zero and also maybe divide by the standard, deviation to standardize or to normalize the different dimensions. It's important to remember that you don't really need to do this all the time.

A

You only need to do it if you have reasons to believe that the different dimensions, the scales of the different dimensions, do not are not important to your to your algorithm or if these two dimensions, for example, are equally important. You think that these features are important. Otherwise you don't need to do to do this. You have to think about your problem a different way of doing normal.

A

There are a lot of ways to do normalization and three depends on what you're looking at and so, for example, another way of doing it is what we call whitening. The data set, which is essentially find the eigenvalues of the eigenvectors of your distribution.

A

Do a transformation to a diagonal matrix, put your data on century on the eigenvectors and then normalize, and then you'll get a white and data. There are a lot of a lot of normalization methods that you need to look at, but you really need to think about the data set that you're dealing with okay, I'm gonna, move on to a slightly different topic. A weight initialization I didn't mention this yesterday kind of thought that it's blessed that when we start our neural network, we start with random weights.

A

The weights of our neural network are completely random, but how do we choose that initialization? First of all, if we don't initialize them at all, we know that there is no learning right like if all the weights are zero. You can think about it a little bit and look at the gradient descent to realize that there will be no updates whatsoever if I initialize, all of them to with a constant value. All of it. There is no symmetry.

A

All of them will be learning exceeding exactly the same gradient update and you are not gonna get anywhere right. So we need to break that symmetry some distribution. The first thought is to use a normal distribution right, like just another initialize and with some constant standard deviation and just use their normal distribution. This is it's okay, it works for for narrow or for shallow neural networks. However, it turns out that if you use, if you use that in deeper neural networks, you turn to the activations tend to to go to two zero mean I.

A

Think it's also easy. If you just look at you know with the pencil and paper. Look at this, if you start with, if you initialize, for example, with normal distribution with one percent sort of standard deviation, the activations of your first layer are going to look like this activations of the second.

A

There are gonna be a little bit narrower as you go deeper by the time you get to the sixth layer you're already at 0.05 standard deviation of the activations and as we talked about yesterday, the activation is going to zero that that the gradients would also go to zero, because we are applying the chain rule right back propagation, and this is not a good idea. This is not gonna. Work is you're. Gonna kill the learning very early on okay.

A

What if I start with a larger standard deviation? Maybe that could be a solution, but it turns out that it's not because also the activations and to to also get saturate 3-1 one. This is with a sigmoid activity, the neural network, with the sigmoid activation. You can see that, but also it's just just to illustrate the idea that even going.

A

Problem is saturation and you also get zero gradients actually almost everywhere, so this doesn't work. How do we get around this? There are all sorts of initialization methods that we have came up with about.

A

A

Number of and the number of.

A

Which is you the normal distribution with 1 over square root of the number of input dimensions and, as you can see, it gets thread of the problem of these zero oxidations or narrowing activations. So we get around this and then we can learn. Things are fine, but this works really well with the sigmoid activation. It turns out that it has a problem with raloo networks and networks that have real uu activations you get into the math. You look a little bit at net works with to renew activations.

A

It's actually very easy to follow the Matt and in this paper and I forget the citation for the paper, but it's essentially by caming he and others. They show that for real who activated networks there's a peer initialization tends to also have a problem with with learning so you'll get this sort of error, it just saturates and you actually, you don't get anywhere, so you would need to use the hemming he and most of frameworks. This would be they're called initialization. So you need to look at this.

A

Dn is the number of input connections to the neuron, and this would depend on if you're doing this is, if you're doing a fully connected. This would, but you just the number of input neurons if you're, using a communal kernel. That would be the number there K by K the kernel size by the number of channels. I didn't want to get to all the math here just wanted to give the big picture, but then you see like as soon as you switch to the he initialization. You actually start. You know the network start learning.

A

There was a blog post, I think I saw it. Last week somebody was pointed out that the default initializer in Karis is actually severe initialization. So as soon as you get to a few layers with riilu networks, it's the network starts learning. So you need to pay attention that there are default values for all of these things in your framework, especially carrots, it kind of assumes stuff. So you need to actually pay attention to what is going on, because this can be frustrating.

A

You can spend a lot of time, not realizing that it's assuming some initializer yeah I can see that sorry, yeah, please. This is the I think this is the validation error, but it could be the training error. I'm, not sure there idea is that it actually the updates. Do you stop getting updates to the network at all to the wait, so it stops learning it's just this. They the two.

A

It makes a difference. Yes, yeah I suggest you also look at the at the paper. I think the math is very easy to follow. There I think I, I I'm, not sure, but I I think this happens with like ten layers of network. You immediately start seeing this I.

A

Don't think that this is resonate or anything like that, but I could be wrong. Okay, I think this actually came before resonant the this paper by the same people who came up with resonant okay, so one thing about initialization is that we tend to think okay. That problem has been solved. Okay, things are fine, has been solved since 2010 or something and we're moving on.

A

However, it turns out that initialization is way more tricky than just that, so there is a series of recent papers that explore essentially the effect of initialization on the performance on of the network and the relationship between utilization and generalization, and all of that so I'm gonna. Try to this is a recent paper last year and there's a lot of work, follow up work this year and try to give a very high-level overview of what is going on here.

A

So the basic idea is that when we build these very big networks and train them get a very nice error and then we want to take them and go and apply them in practice, deploy them in practice. We apply. We want something that is smaller.

A

We want a network that is much smaller than that because it's much easier to to say to they're much faster to to to use an inference mode, and also we know that if we look at all the neurons and the network, we know that a lot of those neurons after the training finishes. One of these neurons are not necessary, so they're dead, already and they're not really doing anything. So we apply something called pruning.

A

So essentially we take the original network after full training, and then we prune it, and we get a network like this that performs equally well today to the initial network. So the question that these authors have explored is that why can't we just start with that? Pruned Network and with exactly the same initialization that we received in the first time and see where we get.

A

Can we actually get the same performance and the answer is yes, it turns out that you can get the same performance if you start with the print network and with exactly the same initialization that we had before there is they call it the lottery ticket hypothesis and the idea is that, after you finish your going your ideas that you're initializing your network- and we have these extremely big networks with so many neurons, because we're trying to explore as much initialization as much an exponential number of sub networks essentially and then there is a lottery ticket.

A

There is one of them that actually ends up being the network that that works when you're doing entrance using the full network, and this is an important idea because it could be essentially it's pointing the to the the possibility that it could be that we are building these extremely large networks, because we have very bad initialization and if we figure out how to initialize our networks better, we might be able to build much smaller networks to achieve the same performance on the tasks that we're looking at. There is a recent paper actually I, think yeah.

A

This is June June last last month, where essentially, they explored this the same idea. They they took a network, they took a network, they initialized it and then pruned, and then they looked used, the same initialization to initialize, so many other networks and see, if that initializes, that set of initial parameters generalize and also help us train other networks to get the same performance. I encourage you to look at this paper is very interesting, but more importantly, is that this is an active area of research. It's it's possible.

A

The within the upcoming months to a couple of years that we figure out a way to initialize our networks better, and then we can build smaller networks to perform the same tasks. Yes, I'm, not an expert on that. So I haven't done a lot of that, but my I think most of what they do is they look at the effect of the participation of certain neurons in the final decision on the accuracy of the network.

A

If, if killing a connection does not affect the final performance, then that connection is not necessary, so you essentially remove it yes, so this is exactly what this paper is doing. So we found that within the natural images domain, winning ticket initialization generalize across a variety of datasets, including blah, and then, moreover, winning tickets generated using larger data sets consistently transferred better than those generated using sport datasets.

A

Actually one of the authors is Michaela she's, one of the organizers I think she will be here today, so you can also talk to her about this, so this is this is at least there. The paper. The case study was on image. Classification. Okay, so initialization is not sure, is not simple. You need to think about the default parameters that you have in your network and soon we might it's it's possible. We might find a way to have better initializations for our networks.

A

If we start with good parameters, we might be able to converge much faster to good performing model. Okay, when I move on to talk about a different topic, which is learning great decay. So, if you're, when you think of a gradient descent, we were always like the the picture of gradient descent. Is that we're using these the gradient information to try to get to a point? That is that many, my that, what we call a minimizer right, a point at which the loss function is low.

A

We expect what's a minimizer, at least in the classic picture and the complex optimization. The minimizer is a place where we're essentially the last surface is flat right. So that means the the slope is tends to go to zero, if you so, but remember that we're using stochastic gradient descent. So, even if the gradient itself, the full gradient is, we are getting closer to a minimizer and the full gradient is getting having us. You know getting to zero, so it's decaying itself, but stochastic gradient.

A

Descent has a lot of noise right because of the stochasticity of the batch, the random batch that we're getting taking from one point to another. So we need to. We need to emulate that sort of effect of having a decaying slope, and there are.

A

There is some result that shows that if you want to use SGD, it's sufficient if these conditions are satisfied, so epsilon here is the learning rate and the conditions are the first one says that the sum of the learning rate along all the steps or the sum of the size of the steps along my optimization equals infinity and the sum of the squares of those steps or the magnitudes. The squared magnitudes of those steps is finite.

A

Intuitively the first result says that, if the, if I start from a completely random initialization point, it's a matter of nowhere with the number of steps that I am taking I'm guaranteed, that I can reach the minimizer wherever it is, I have infinite range to get to it right. The second result intuitively says that, if I get close, I will be able to converge to that point. I'm not gonna, just be.

A

You know swinging around that point, so I'll be able to converge to that point. So how do I achieve this in practice with SVD? We do learn in great detail. So essentially we decay what we call decaying the learning rate as we are going on so in practice, you'll hear about something called the learning rate schedule and that's what you will be using and thinking about.

A

There are a lot of different types of learning rate schedules, people use, linear decay or exponential decay or cosine or inverse square root, and this is usually as a function of the step, more often as a function of the epoch. The number of passes that you have through your data, so, for example, linear decay, would be you have a initial learning rate multiplied by 1 minus T, which is the the epoch number divided by the total number of epochs, and you control this by thinking about.

A

What's the final learning rate that you'd have, and you see that this is a decaying function. So the basic idea, which josh also touched on this yesterday, is that the networks that we are looking at we're not trying to get to a global minimum there are. There is a large number of good minimizer's of the entire loss function on that data set, so there's an abundance of minima's that are equally good and there are results showing that those numbers are all actually within good performance.

A

So, whatever you get to whatever minimizer you get to that's good enough, you just don't want to get stuck in an extremely bad one, very early on, so you don't decay too fast.

A

Now, a more practical way of doing this, which is touches on your point, is to actually, instead of trying to a priori, decide how you want to decay your learning rate you can use, you can monitor your loss function or your monitor the performance on some validation data set and only dick reduce the learning rate when you get to when you're stopped learning with the current learning rate. So if the loss here tends to get to almost zero slope, you reduce the learning rate. If you get to another plateau, reduce the learning rate.

A

This is actually the plot of training resonate, and the decay here is by a factor of 10. So if they divide by 10 at each point learning rate, there is in the frameworks that use especially, there are stuff called like reduced learning rate on Plateau. This is a callback that you can add and then you can decide. What's the patient's which patients here would be the number of epochs that you would wait for before you decide to decay the learning rate, so here, for example, this it hasn't been indicated immediately.

A

So you waited more than 15 epochs or 20 a box here or yeah, and then you decayed, so you can apply, put your patients. You can have the minimum learning rates. You don't want to reduce beyond that, and you can also have the decay factor that you can use.

A

I think this is a more practical way of doing the learning rate decay unless until somebody shows that this is a bad idea and I'm sure somebody will so maybe yes yeah, that's actually a very, very important paper and I think we will talk more about that paper on Friday during the scaling training, deep learning models at scale.

A

The the basic gist of that paper is that it turned out that the more complex the data set that we are applying, where we're applying our models on and the more complex the the models themselves, the the larger the batch size that you can actually use and that having a larger batch size helps. You paralyze your the optimization of your model and all those sort of stuff. So- and this is about to mention that another way to do things is I rather than decaying. The learning rate you can.

A

If you look at the relationship between the learning rate and the batch size that you're using remember that we're trying to reduce the learning rate because we're trying to counteract the the the noise we're annealing. Essentially the noise that we have in the stochastic gradient descent and.

A

One way to do this is instead of instead of decaying the learning rate itself. We decide that. Okay, let's just increase the batch size, there's a bunch of papers that explored this idea. One of them is with the very interesting titled on decay. The learning rate increase the batch size and they show exactly this.

A

Instead of lowering decaying the learning rate, they actually show that the king, the learning rate or increasing the batch size you can achieve almost exactly the same loss, the lost curve or training curve, and they even have a hybrid approach where they're decaying the learning rate, while increasing the batch size, they can achieve the same thing. If you do this delicately, you can achieve the same thing and the paper that you mentioned by opening I explorers. This idea further and actually derives a lot of relationships between website learning rate and the optimization progress.

A

That's a good question, so the basic idea is that, if you're, if you want to decay, we're trying to reduce anneal the noise right, so we can decay the learning rate, but if I increase the batch size instead, I can paralyze my process and train faster. So if my batch size I go from batch size, whatever 10 to 100 I can, instead of using a GPU I, can use 10 GPUs at the time and then I finish the training faster.

A

So it's it's about the world wall, clock time finishing the training faster thanks for the question. Other questions. Ok, so now I'll move on to talk about regularization. So remember yesterday we said that if we monitor the training error- and we monitor the validation error, there are multiple regions and we said that we want to get out underfitting as soon as possible, and this this is usually easy to get out up. There are multiple ways of doing this.

A

Can you know so, and then we spend most of our time in this regime where we're trying to essentially push this point along as much as possible, where we're trying to reduce the error on an unseen data set or the validation dataset. We talked about the generalization. So how do we do this? In practice? We use regularization.

A

So this is the last function. This is the the the loss function that we have on a on a mini batch right. This is what we said. We're trying to have essentially trying to fit to have the last function is trying to targeting the data set. You are trying to fit very well to the data set.

A

You could have a label here why, if you are dealing with a supervised learning setup- and this is on a mini batch, so what we do is we add a regularizer to the to the last function, and so the Reg Arizer is usually in this form. So you have some penalty on the norms of the weights, and the job of this is to essentially say like don't fit too well to the training data. Just we want. We don't want to over fit to the training data itself.

A

There are many ways of thinking about this, and this is not special to neural networks or deep learning. So this is in the context of statistical learning in general, and it could you can't think of it as a way of of choosing a simpler hypothesis than an extremely complicated hypothesis that that over fits the the train, they are very well so, a few terms, this lambda is usually the strength. The lambda coefficient is the strength of the regularizer which you want to tune as well as it's a hyper parameter.

A

The weights, the the penalty that you apply, is applied on the way it's not the biases. The rationale is that we have to to a small number of biases. Anyway, we don't really need to regularize those, and then we need the biases to be free, because if there are any shifts in the activations or the data, we want to be able to get those and in practice, if you try to regularize the biases, you tend to under fit your day. Your training data anyway, so types of regularizer is I'm.

A

Gonna go and talk about these and some of these in detail, but just an overview of things.

A

You probably have seen an l1 or l2 regularizer, so l1, essentially, you add to the last function the sum of the absolute value of the weights, and this tends to create sparse representations, which assumes that if you, if you have reasons to believe that your activations should be sparse or if you tried it and it turns to work out well then this is the thing to use you. Can it's easy to think of why this creates sparse, no sparse representations?

A

The penalty here is on the size of the weights themselves right whether the weight is 1 or 10 its penalized. If it's not 0 its penalized, so it actually is trying to force the network to learn a sparse representation only have a nonzero weight. If you have reasons to believe that this is helping the optimization right, another type of regularization, which you see this more often sometimes it's called the weight decay, as you add, to the loss, function essentially the norm of up your weights, and this is a different type of your regular.

A

It says only like don't have too large weights right. This is, if it's small, it's not. The penalty is not that strong, but if W is very high, the penalty is strong. There are a lot of connections to Bayesian optimization. If, essentially, it could act. If you can think of this as a Gaussian prior on the weights, you have reason to believe that your weights are a Gaussian or normally distributed. So the log of a Gaussian gives you W squared this.

A

In practice, we tend to use L to a lot other types of regularization things like noise robustness. You want your network to not be too sensitive to the size of a tutu to the way the value of the weights themselves. So you add some noise to the way. It's true, you want the decision to be independent of small perturbations.

A

The ways that you have this is another type of regularization I'll talk about every stopping dropout normalization, adversarial training is something that we're not going to get into, but you have seen a glimpse of this in Josh's talk yesterday, and it's also a form of regularization that that tries to counteract this problem about your serial examples, and you will see more, we have been seeing more and more of this in practice that which, for many reasons for improving the generalization of our networks and also for counteracting this problem of adversarial attacks for security reasons, I'm sure you will hear more about this during the week.

A

Okay, so yes early stopping we mentioned yesterday that early stopping is something that you want to apply all the time. So essentially you monitor your validation error and, as soon as your validation error starts, climbing that's where you want to stop, because this is where your model has the best performance, and this is it's a type of rigor ization. You can see if you look at the angered fellows book.

A

You'll see a connection between early stopping under in some simplified setup, early, stopping and l2 regularization, so essentially early stopping the way that you can think about it. Like l2, says: I don't wander to too far off from the initial parameters yeah. So this is something that a type of regularization they use in practice dropout. You have seen this in the network's yesterday. The basic idea of dropout is the following: we have. This is a fully connected Network, and this is what we are training there is.

A

The the idea of dropout is to randomly drop connections while you're doing the training of your network, and the basic idea is that maybe, if I randomly drop these connections, I will discourage the network from having a cohabitation some neurons fire only when the other neurons are firing. I can also make again. These are all intuitive pictures that are helpful, but they break sometimes so another way of thinking about it. Is that your forcing the network to not to rely too much on certain representations to be able to make its decision right?

A

So you're forcing it to rely on multiple sort of representations to be able to reach the same same decision, and what you do is that in at inference time you use the full network and then you can look at the matter a little bit and then you will. You can Reba rebalance or renormalize the output so that you essentially cancel out the property, the this probability of dropping the network connections.

A

Another way to think about this is that you're continuing your training instead of one network, you're training, a large example an exponentially large and sample of networks right when I have random, dropping out connections. All the way through my network I have a random sample every time, I'm training a different sub network.

A

In practice this tends to work really well. So you see with this is the last function of training, a certain network without dropouts. This is what dropout you can see, how it improves the it reduces the classification error. Where do you insert dropout? Yes, yeah? That's that's exactly how you should do it so essentially, when you want to you, want to average so you're training, these sub networks right, but at the inference time you want to average the decision of this in Samba love sub networks right.

A

So you need to cancel out that dropout probability that you have so, if you're dropping out to probability 30%, you want to actually rebalance the output and then I didn't put the math here, because I wanted to get through more slides. But you want to pay. Actually that's a good point.

A

You want to pay attention to that when you're doing when you're building your network, that sometimes you would get different results and when you have dropout and if you don't pay attention to what your you're doing or you're using the dropout in inference in training mode, you will get a different result that when you're using it in training mode- and this is essentially for the three bars the probability yeah. Thank you it. It could very well be that if you train a network without dropout, you will get more more neurons would die.

A

But if you train it with dropout, you will get more neurons, who would actually stay alive and then your prune network would be smaller. It could be very wealthy, but yeah. So I I see your point. They might. There might be a connection. That's a good question. So this was my next slide, where I do think. That dropout is not necessary and there in between the convolutional layers, because there are we're trying to learn filters and feature extractors and there usually have much less number of parameters than the dense layers.

A

So I think that it doesn't make a lot of sense to do drop, to apply drop out there and I put here an example of one possible way of putting dropout is look only at the classifier or the last dense layers and put some dropout in there. That's it. It might be that if you put dropout in your network, it might perform better I haven't seen a lot of results on that.

A

Is this idea clear that, where to put the dropout not clear, okay, so another form of regularization is data augmentation. So we mentioned yesterday that the best way to improve the performance of your network is to collect more data right. If you can do that, that's awesome. If you can't do that or even if you can do that, you might still want to also apply some sort of data augmentation when we are applying, let's think of the context of object, recognition when we are applying neural networks or confusion, your networks on object, recognition.

A

We know that the decision of the network should be independent or invariant of the orientation, for example, of the objects or the color, the you of the objects or or whatever sort of transformation mirroring the the objects we should get. The cat should still be the same tax right, so one way of forcing the network to learn that is to augment the data set by applying these transformations randomly on the original data set, as I am training. Of course you don't want to apply that during the test or validation, but during training.

A

This is what you do, and it tends. This is something that we do a lot in practice and it improves the performance of the network of all of models. You want to make sure that whatever transformations that you're applying actually make sense for your networks increase the size of the training dataset- oh you can't see it. This is a Z.

A

Okay, sorry thanks thanks, okay, so so this. This is something that you can do. If you're looking at the TF data data set this data ingestion pipeline that comes with tensorflow, we usually do these transformations on-the-fly in that pipeline.

A

If you have a very large set of you're, doing a training of an extremely large model, and you have a lot of workers doing the data shuffling and the data ingestion and all of that sort of stuff, you also want to get them busy doing data augmentation, okay, another type of rigor ization this time is different, is a little bit different.

A

It's there are a lot of things that we do to help the performer to help them to make the optimization process easier, while we're training, neural networks and some of them they end up being implicit, regularizer x'. There is some contention about the word implicit here, but they are not express it as and they're, not just added to the last function. One of them is called batch normalization.

A

The essential idea of nationalization is that look at this Network, you have a hidden layer. Hidden layer gives some activations the activations go into the neck they're. The the main idea was that, if, if I updating to the both layers, at the same time, this layer is being updated to respond to the to the current gradients, but by the time once I update this this layer I'm also updating the previous layer.

A

Once I update the previous layer, the activation of the Prius previous layer are shifting, so once they shift, then this has to relearn how to respond to the new activations. This picture is kind of the the basic idea that people you know I've been thinking about when they came up with bachelor ization. There are a lot of reasons or recent work to believe that that is not exactly how bash normalization helps. I'll talk a little bit about this, but that's that was the initial idea you can think about it.

A

From a different point of view, we talked about normalizing the input to the entire network. The batch normalization tries to normalize the activations themselves, so every one of these layers receive the same distribution of data rather than just the very first layer. We received some some normalized data. The way that you do this. This is directly from the original bash. Normalization paper is that you take the sum along the batch and then you, you may get the mean along the batch. You can.

A

The standard deviation along the batch and then you normalize, the activations of the layer by subtracting the mean and dividing by the standard, deviation plus an epsilon for numerical stability, and this is great, this normalizes the outputs of every layer. However, what, if the that you know, severely restricts the capacity of the network and the capacity doesn't really want completely normalized distribution, so you allow for those shifts by having learn about parameters gamma and beta, so to rescaling.

A

The essentially the activations by gamma and beta and gamma beta are learnable parameters, so you actually update those with gradient descent. I think it will depend on the problem that you're looking at a lot of people. They do augmentation by doing random crops and then generating texture around the random crops, and they that works for the problem that they're looking at yeah.

A

If it doesn't make sense for your day, you should definitely you're teaching the the Euro model, the wrong thing right, you want it to in your case, you want it to make sense of the order. You actually want it to use that information as a clue to what is going on if you're, removing that and you're removing an important feature from your data, you want don't want to break essential features in your new data or essential correlations.

A

Okay, so batch normalization, so in practice it helps with a lot of things. First of all, you remember the problem of vanishing gradients that we talked about when we build deeper networks. Essentially, we have when we multiply gradients along the layers, the the gradients tend to vanish by the time they come to the very early part of the network. We'll talk about this in a little bit more details, but the important thing here is that batch normalization, just because the distributions now are normalized.

A

So, instead of the continuously narrowing distributions, the distributions have some standard deviation, they're fatter, and so it tends to help the gradient flow all the way to the earlier layers. So this means that it's easier to train deeper networks with batch normalization.

A

It also empirically it seems to be that Bachelor ization makes the network less sensitive to some hyper parameters, so we can use higher learning grace and that leads to faster conversions, we're taking larger steps right as we're doing the gradient descent, and it also tends to make the networks less sensitive to the initialization. You don't have to do so. Many restarts to find a good initialization for the network. I talked about this idea briefly of the shifts of the distributions between the layers, and that was the issue of motivation of rationalization.

A

However, there is a work from end of last year actually met here, but it was a new Europe's last year which shows that the idea of shifting distributions is not over. Shifting covariant shifts of distribution might not be accurate, they show that empirically and then they try to argue that bachelor ization accelerates the optimisation process by making the landscape essentially smoother.

A

So you can actually look at the so they look at different measures of essentially the smoothness of the gradients and how much noise there is, and they show that the landscape is a bit smoother when you use batch normalization I'm, not sure if there is an update on this picture, but that's one of the ideas. One thing to remember two things to remember the first one is that batch normalization is an implicit trigger Iser, so it does affect the capacity of your network. I'm, not I.

A

Don't think that there is an explicit way of trying to measure the impact of a batch normalization on the capacity of the network, but it does. The other thing to remember is that bachelor ization behaves differently during training and test time, so during training we're using the batch statistics, the mean and the variance of the batch to do the normalization, but during test time we don't want to use that, because the test dataset might have different statistics than what we trained on.

A

So what we tend to do is to accumulate running averages of the of the of the training batch and and standard deviation and use that during test time, this tends to be one of the very common bugs that you have when you're dealing with personalization looking things looking at code batch normalization normalizes along the batch dimension. So it's a long if you're looking at this is your very fat four dimensional tensor input to the network. This n is the number of examples that you have in your mini batch batch normalization normalizes along the batch dimension.

A

There are all sorts of other types of normalizations that don't normalize along the batch dimension, layer norm, instance, norm and group norm. You can see more of this in this paper. Could normalize paper the basic idea, these other types of normalization do not depend on the batch and that's nice, because we have the same normalization during training and test time and also helps when you're doing distributed. Training thurston might talk about this on Friday, so batch normalization is for the activations in the network and data normalization is for the data itself.

A

You don't apply, yeah, that's a good question, so the mini batch is related more than this. It's related to stochastic gradient descent right. We mentioned yesterday that the noise in the stochastic gradient descent tends to help. So most of the time using a small small batch size gives you a model that generalizes well.

A

However, it could be so the practice is that not to use batch size larger than 32, but it could be that the learning process is extremely slow with that batch size because you're, you have to take a small learning rate, there's a very large noise. So to get beyond that, you want to to increase the batch size beyond that. There are all sorts of things that you want to look at and I think source and we'll go through this on Friday, but generally smaller batch size gives you better generalizability larger batch size.

A

You that's what you want to do, because you want to train your model faster, but there are caveats on this. I think I've seen things like this on Stack Overflow or something but I'm, not sure I'm, not I, didn't do any experiments. Myself and I haven't read any concrete, like paper, solid papers on this topic, but you can think of one thing. Batch normalization is applied in the layers in between the convolutional layers or after, like a block of convolutional layers.

A

Dropout is applied in the classifier on the dense layers, so they're in different domains, they're not usually applied after each other, at least in the networks that I have looked at so they're in different parts of the network. It really depends you can. There are people who swear by before the activation or the activation?

A

The general guidance is to use it before the activation, because you don't wanna, you want to have the full statistics of your badge before you apply raloo and kill or the negative parts of your distribution, but there are people who show that it works better when it's applied absolute activation, if you're, if you're dealing with the data, where, with a problem where you have someone else, has, has a network, and it shows that that network work work well, you might want to just look at follow their example.

A

If that works for you, okay, so another thing that improves the performance of your network is the idea of in Samba 'ls. So the basic idea is that, instead of using one version of our model at inference, time is to use multiple versions of them in Samba, with models and then average the results of those example of models it tends to to give about two percent extra performance in practice.

A

This is an empirical result, and so it's easy to do this when you have a shallow learning sort of model of traditional machine learning to actually train multiple versions of your model and then average the prediction, but it could be really expensive in deep learning specially if you're, building a very big model. So another way to do it is to use multiple snapshots of the same model during the training process.

A

So you can do this in various ways, but save different checkpoints after you get to the region where you start getting satisfactory performance, you can save multiple checkpoints of your same model and use an in samba love. Checkpoints to do the averaging another way is to keep a moving average of the parameters of the actual network parameters called Puglia averaging yeah, it's another way of doing in Sandow. This is, if you're, applying networks and practice you really wanna. You might want to test this. It gives two percent: it's not trivial.

A

Okay, so now, I want to move on to in the last 20 minutes to the idea of depth. Another issue that we have. So if you look at the winners of the imagenet competition in the past since 2012, you will see a common pattern that the number of layers that are used in the winner's network they're all deep learning networks anyway. Since then, the number of layers has been increasing right. So 2012 was the first winner.

A

Alex net was about eight layer or eight layers, and then by 2015 we have a hundred and fifty-two layer resinate, so it seems that deeper networks tend to perform better.

A

This idea is related to the two. What we talked about yesterday, that at least when, right now, when we look at these, what these networks are learning, they tend to learn this hierarchy of features or hierarchy of filters that, when the very early layers, they learn edges and blobs and very simple motifs. And then, as you go to deeper and deeper sort of models, you can start building more. They start to activate on more abstract sort of concepts.

A

So it seems that building deeper networks it tends to encourage the network to build another longer hierarchy of filters. That said, take all of that with a grain of salt. This is again on hand, wavy sort of arguments, but in private the important result is that deeper networks tend to perform better in practice.

A

You can see this with simple examples, so maybe two plots is so you can, for example, this is a test of the performance of a network as the number of layers increases, and you can see that as the number of layers increase, you can get better and better performance. If you don't believe this, you want to look at, maybe because the number of parameters is increasing. This Y is performing better /.

A

That's not true, because if you look at the accuracy on the y-axis versus the number of parameters with, maybe you can look at only at the blue and red so at the same number of parameters about 200 something million about 200 million parameters. If you put them in 11 layers, you tend to get 2 percent or more than 2 percent, better performance than if you put them in 3 layers.

A

So you can do these sort of experiments and see that with the same number of parameters, if you reconfigure them in a deeper network that the networks tend to perform better. Now, that's the result, but in practice training deeper networks tend to be more challenging. You can look to the left here. These are the training errors of training at 56, layer, Network and then at 20 layer network.

A

So the training error that the 20 layer network tends to get much better performance than the 56 Network, and if you think this is overfitting, it's not actually true, because even on the validation, you can see the same thing. So one of the reasons that training deep networks is difficult is what we mentioned yesterday is the vanishing gradients that we have to use back propagation and then we're multiplying a very long chain of of gradients if any of the activations along the weight and like the distributions get narrower and narrower.

A

So that seems to kill the gradient and the gradient flow back to the early layers, and sometimes you also get exploding gradients in recurrent networks. But this idea of making essentially making that information or the gradient updates travel all the way back to early layers is extremely important to be able to train deeper networks.

A

Some people have thought about this and they asked the following question: if I have a network with this number of layers, I, don't know how many layers there are here if the IVA network, with these this number of layers- and it performs well, why, if I, add layers in these gaps, it doesn't perform equally well in principle, the network should be able to learn just an identity, mapping right, just okay, make these identity mapping and then and just learn, get the same performance.

A

However, the optimization process becomes difficult, which is a part of it is the vanishing gradient that we talked about earlier, so the idea that they got is to use something called the gradient highway. There was a paper called the gradient highway before, and this is a follow up. Paper called the residual residual net, which is ResNet that we talked about. They come up with this idea.

A

Why don't we have the identity mapping as a part of the construction of the actual network, so instead of just having layers and they are stacked after each other, try to learn instead of just trying to learn the output of a block, we try to learn the residual of of an output. So what we are trying to get is f of X, which is the output of this function, plus X.

A

So this function will not try to pass the full X and learn some useful features from X, which is what we really want, which is H over X. It will only try to learn that residual, which is f of X, and this way, all of a sudden. We have a highway for the gradient to flow right, so the gradient can flow from the loss function, which is here all the way back to much earlier layers, and we can start immediately. People started training, hundreds to thousands of layers network, just sidestepping that vanishing gradient problem.

A

Yes, that's a that's a very good question, so I think in this. It was an element-wise addition, but there is closely related constant. There are two closely related concepts, which is kept connections and residual blocks, kept connections, I think one of them I can't remember it actually, the terminology which one but the one of them concatenates the filters. And then, if you have yeah, you concatenates the filters.

A

So that's great if you want to add them to each other, like you're doing here in ResNet, you probably want to use one-by-one convolutions to to match the number of filters of the previous layer to the next layer and then do the addition. Thank you, okay. So what happened when this?

A

When this paper came out, they won all competition. All you know: subproblems and imagenet competition in 2015 by an extremely large margin, image, net detection, 16 percent better than second second winner, 27 percent on object, localization and then other competitions, cocoa and cocoa segmentation, 11 percent 12 percent. This is ridiculous. Okay. So this the idea here in 2015, you know this network, the already the image net classification with ResNet. We achieve better than human error, so this ResNet performs better in on an unseen data set than a bunch of humans.

A

Okay, the images are not there there, there are 128, there are lower resolution pixels and they are taken and real with real lightning and real all sort of real-life conditions that it can actually be confusing to people I'm, not sure how exactly the human test is performed. But if there's time you know constraint or something, but there is a there- is an irreducible human error here.

A

Okay, so I don't have time to actually talk about this I just wanted to mention that you might think that the idea of this idea of skipped connections or hot gradient highways has actually enabled.

A

Has resolved the entire problem of vanishing gradients, but there are, but there is another way to look at it. It has actually shortened the affective pathway from the last function, all the way to early layers. If you are interested in this, you can look at this paper. So essentially you can see that the affective pathway from the very early on to from the last function to two early layers is about nineteen layers. Instead of the entire stack of layers that you have I want to touch on a couple of topics.

A

Actually, maybe one topic before we finish a transfer learning. Unfortunately, we don't have a talk on transfer learning for reasons the speaker couldn't make it, and this is a very important topic in practice. When you're training neural networks, there is one you always we don't always have millions of images or millions of data label. Data sets to be able to train, resonate from scratch, and you also don't have the time or the computational resources to be able to do this from scratch.

A

Training, ResNet I think anima genetics on a single good single GPU takes about ten days so and if you're doing this and practice you'll never get that paper out. So.

A

So we use something called the transfer learning and there is a closed concept to this, which is called also domain adaptation that you might be a training on on a certain data set, but in reality we are applying your network on a slightly different domain or slightly different data set than the the training data set. So how do you actually deal with that problem of domain adaptation? It's a closely related topic and it's very important in practice.

A

So just a reminder of what Brenda showed yesterday, the difference between a machine, learning and deep learning is that in a traditional machine learning we can think of it. This is one of the differences.

A

You know everything that is there to say about this is that in traditional machine learning we do feature extraction by hand, and then we apply a classifier network or a classifier, whatever model SVM or it could be a shallow neural network, whatever it is, and then we get our output in deep learning. We're doing this end to end. So our neural network does both the feature extraction and the classification. So all of that those convolutional layers that we had in our network, those are feature extractors and then the dense layers are the classifiers.

A

So the idea of transfer learning is that, when you're going from, you have a you, have you want to train on a small data set or a slightly big data set? But it's not large enough to train the entire resonate or entire vgg network or something is to reuse those feature extractors. So essentially you take you get a pre trained, neural network. You keep the convolutional layers, because these are feature extractors, you think they're useful features. They build hierarchy of concepts and abstract stuff, and then you retrain the the classifier.

A

You retrain the last layers, which is these three layers. If you don't have a lot of data, you could train only the very last one. If you have a little bit more data, you can train two of these. You can might be able to train the three of them. If you have a larger data set using pre trained network, you can also fine-tune your. You can also fine-tune your feature extractors, which is the the convolutional layers.

A

If you want- and you have data enough to do that- and this is very useful in practice, in fact, there's this slide in CS 231. This is the transfer. Learning is pervasive. It's actually the norm, not the exception. In reality, you don't want. You have an idea, you're sitting down with your friend, you want to test it. You know, after lunch you and I go and test it. You're, not gonna. Just you know, spend ten days to test us a very simple idea right.

A

So what you want to do is to you want to get results by the end of the day right so you're going, you use, transfer learning, and this is done everywhere in practice. This is what people do and they encourage the framework. You look at that yesterday. It's very easy to do this. It's just one line to get a pre trained Network any of these networks. You can get them in one line and then removing parts of of the network's is also super easy in practice.

A

There is a paper that showed last year, that is this extremely necessary, and the answer is no. If you have enough data set, you don't really need to get to the same accuracy that you would get with a pre trained network. You don't need to to use transfer learning. So essentially these the two curves here. This is the the accuracy of a ma of resident model that the two curves are the in magenta is. The random initialization in gray, is with retraining pre training, which is the transfer learning you can see.

A

If you have enough time you're you know have nothing to do and then you have enough data set. You will get the same accuracy. However, if you don't have enough time, the pre train network tends to get to a much to a better accuracy, much faster than they randomly initialized one. And if you don't have data set, you don't have an option anyway, you have to use pre transfer, learning, okay, yeah, so the great curve I, actually don't. Remember- I, read this paper last year, end of last year. So but that's a good question.

A

I I think they retrain only the last dense layer and I. Think in resonate in this particular resonate. There's only one last dance layer and they don't they don't fine-tune the convolution layers or it could be that whatever batch size is still much less than the maximum batch size that you can use. This is related to the paper that was mentioned this morning.

A

The opening I that, as the as the training goes on and you get to smoother parts of the loss function, you can use much much larger batch sizes as long as there is a ceiling to that where they do some derivations the maximum batch size, but it tends to be very, very high ceiling. Oh that's the learning rate decay, that's a good point, so you remember we were talking about the learning rate decay very early on. So these are the points where you decay the learning rate, and then you jump.

A

So these jumps are the same as these jumps I think this is the learning rate decay I know because this this curve here yeah, you know it's yeah I need to relook at the paper. I.

A

Think I vaguely remember that if this these are different places where you could decide to stop the training and then you do fine tuning some fine tuning schedule. So these are different possibilities for how to where to stop the training and doing some fine tuning of the entire network.

A

Sorry I shouldn't present papers that I have read six months ago, so, okay, so two other topics that are important in practice: hyper parameter, optimization we've talked we mentioned, so many hyper parameters that we have in these neural networks. And you might you want a way to do that type of parameter, optimization, there's a talk tomorrow by Ben, Albrecht, I, think he's here and then please, hopefully, you'll be there for that.

A

You'll talk about the difference between doing great search and random search, and also maybe some other, like optimization methods like Bayesian, optimization other methods, and you will also show how to do that with a particular framework. There are frameworks to do this. You might want to look at when you're doing. If you don't have a lot of time to do a parameter optimization. There are a few parameters that are extremely important to tune. The first one is the learning rate, especially if you're not using other.

A

If you're not using a dime, you probably is less sensitive to the exact value of the learning rate, but if you're not using Adam, you prob want to do at least learning great tuning, and you want to do that using random search rather than grid search, because this is wasteful. You will hear more about this tomorrow, a couple of training tips. We talked about a lot of, we talked about things like the initialization and how the initialization can can prevent the learning.

A

If you have the wrong initialization, and there is the way that we at least Illustrated, that we showed the distributions of the different activations, it turns out that in practice, when you're actually doing debugging your model and trying to find out why it's not working, this is a very good way of actually finding out. You know. Are things at least? Is there a learning impediment somewhere?

A

Is there some a lot of 0 activations if you're using Renu, for example, it could be that the network needs to have some leaky raloo to to pass back the gradients. So the way to actually find out about all of these problems is to monitor the activations activations distributions.

A

Yeah I I didn't have time to actually find plots for this. So sorry, another thing that you might want to look at is to watch the update scales for your weights. So you can. This is not correct. This is just the monitor the update scales, which is the gradients divided by the weights. So you want your gradients divided by the weights to be somewhere between one to one part in a thousand to one percent of the way.

A

This is just a rule, but you really want to see that your they, the updates, are not much larger or extremely large that they are overwhelming the weights and then supporting the whole thing off. That's another thing that is very useful to monitor in practice. Another way is towards the end of your training. You want to see if your network- this is good enough, and this has worked one way of fine.

A

You know inspecting the quality of the final network that you have is to visualize the weights of the first layer. If your weights are like this, their crisp, clear, they're very there, it's very clear that they have very nice edges their activators and filters here. This is where you want to be. If you have, if you have this type of you know visuals or of your or your first layer weights, this could be an indication that you either haven't convert or there might be problem with your weight regularization.

A

So that's another way to inspect problems in your models. There are a lot of. There are a lot of yep. Okay, there are a lot of such you know, tips that I think they're, really really useful to look at under Carpathia has compiled a large number of them. Last April and I encourage you to look at this, this blog post and actually think about each one of them. Why it makes sense to do that? Why? Why is it useful to do that?

A

You will I mean you have to equip yourself with as many of these diagnostic tools as possible to be able to debug these networks faster. That's at carrying questions.