National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 02 - Introduction to Neural Networks I - Mustafa Mustafa

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Again, Brenda's done a great job with setting the stage on general taxonomy of methods and solutions in in machine learning, but obviously the workshop is around deep learning for science. So most of us gonna really take a deep dive on on deep learning methods so most ways a machine learning engineer in the in the data and analytics services group at a desk and, as I mentioned earlier, he's really the the mastermind behind the deep learning summer school. So really any feedback that you have.

A

You know please to pass it on to him most of our physicist by training- and you know again these days he's interested in generative modeling, incorporating physical constraints into deep learning, architectures and really the broad suite of what classes of architectures are most relevant for science problems.

B

Okay, so can you hear me okay? First of all, this is one, so you heard the introduction to machine learning talk this morning. There are. We have two talks that are specific on neural networks and there's actually more than that.

B

So today later today, you will hear Josh Gordon talking about tensorflow ecosystem and then, after that, you have a hands-on session building the sort of things that we're going to talk about now with Steve and tomorrow morning give another talk on your networks as well, and tomorrow's talk will focus on actual things that you have to do to make these networks work in practice.

B

The second talk to tomorrow's morning session by then you will be deep in the trenches, so Joel will talk about training, very large-scale, deep learning, models on large-scale data sets and all the old also to practical questions and concerns that arise. So it's a steep curve from here but till tomorrow morning, essentially at the end of the morning session, and then after that, most of the topics will be essentially a tour de force, so they all the methods and the applications where people have applied.

B

Deep learning, so it will border somewhat between applications and also latest research. During my talk, please, if you have any question, just raise your hand and ask I'll, try to repeat the question or you won't having a chance to pass microphone during their talk, but please interrupt me and ask any question. You have okay. So before on with my talk, you can come here and attend a week of talks and lectures or you can take an online class on you know there are. There is no shortage of online classes for teaching deep learning.

B

However, I really think that if you want to understand the all the intricacies of everything that goes into doing deep learning in practice, you really need a solid undergraduate level course in deep learning. There is one such course online and I think if you have done deep learning, you probably know this course. It's the stanford cs 231 course. The videos are from spring 2017 the lectures the lecture slides have been updated with spring 2018. I think, but the videos the videos are are great.

B

So if you really want to do deep learning for a living, take some time to actually go through these lectures and probably the hormones as well, there are about fourteen lectures. There are less lectures, then you will take this during this week. So it's not really a lot of work and during making my slides for today and and tomorrow, I made actually great use these lectures. You will see a lot of snapshots from doors. I also make good use of deep learning books.

B

These are two excellent, deep learning books that you probably have seen these before and there are, but there are many others. Another thing is engaging with research, I think there is, you know if you want to do deep learning. Nowadays, you really have to be up-to-date on a lot of the research that is, that is happening, especially if you're using a cutting-edge sort of model. I do remember that we ever talked in November last year about the scaling, training, deep learning scaling at scale and by January.

B

We were about to give another talk and we had to update the slides with the latest research that happened in December. So a lot of stuff are happening, especially if you're working on cutting-edge methods, the still dot up is, is a it's essentially a journey. The pedagogical journal, where they try to to essentially expose a lot of important machine learning and deep learning concepts.

B

Some of them are latest research sort of concepts, and some of them are fundamental to everything that we're doing like, for example, why does gradient descent with momentum work so to make sure to check out this when you have some time? Okay, so that aside, the talk for today I'll try to essentially talk about the neural network and neural networks- basics, ok, trying to find this new networks- basics Brenda- did talk about this I want to go through those basics.

B

Again, after that, we will talk about how do we optimize these neural networks and how essentially, how do we? How do we construct this problem of optimizing, a neural network and then how do we actually find those parameters of the neural networks we'll talk about in practice? What we do to monitor that the training or the learning process of those networks and then at the end, we'll get into convolutional neural networks, basics I'll! Try to keep this talk at a conceptual level, so you won't see a lot of math.

B

You won't see a lot of practical tips and and nitty-gritty details of actually King these things work. Those will defer them to tomorrow morning. It's inevitable that there will be one equation or another. So if you look back at the history of of gira networks, you immediately realize that a lot of the terms that we're using right now did appear before right. So we talked about perceptrons there. Nineteen fifties, you probably have seen back propagation somewhere. This is nineteen seventies lsdm and of the 1990s.

B

So people have been working on this for a while a lot of the technologies that we use right now, they're, not new right. However, it's only the this explosion of results, of applying deep learning or and successes and deep learning has only happened recently, as Brandon I mentioned. There are factors for why this has happened. First of all, we do have data.

B

We have a lot of lot more data than we had before, something that distinguishes the performance of shallow learning sort of methods, the ones that Brenda talked about like SVM clustering methods and all those sorts of sorts of problem models from deep learning is that shallow learning methods they turn to Plateau their performance tends to plateau after a certain amount of data beyond that. Actually, beyond that, it becomes very expensive to to evaluate them and train them.

B

For example, in clustering and clustering, you might have N squared sort of algorithms, but they also tend to plateau in performance. Deep learning is very data hungry. You will see tomorrow in Jules. Talk that deep learning performance, their models tend to have a power-law dependence on the amount of data that you train on, and they continue all the way until an irreducible error, where you can't get more performance than that.

B

So the available availability of such datasets to actually build models on is an essential component of why deep learning now has worked. The other thing is being able to calculate all of these big matrices very quickly, and that has happened thanks to all the gamers by GPUs right. If you look at the a plot like this, for example, this is the error of the winning imagenet competition, winning algorithm and, and then the number of users in blue the error is in is in red.

B

You see that 2011 was the last time, then that non deep learning methods was was used, and that was the era was about 26%, deep learning, the first time that it won this competition. It reduced the error from about by about 10% from 26% to 16%. You see that this was a tremendous jump right, and that was the very first time the GPUs were used for such algorithms.

B

Yes, this is the error rate of the winning algorithm in the competition so being able to calculate to build bigger, neural networks, bigger functions and optimize them is an essential component and why this works now and last is that that a succession of algorithms that we have seen on the on the previous plot has finally actually has finally worked. So people were thinking of better optimizers, better regularizer, third normalization methods, all of these algorithms. Without them things don't work, and you will see tomorrow.

B

It's not easy actually to get this, these neural networks to work, and we will see a lot of algorithms that make it slightly easier to make them converge. Ok,.

A

B

That's it I want to get into talking more about deep learning. So this is the long story short. This is essentially everything that we'll be talking about today. So what deep learning is is a family of parametric, nonlinear and hierarchical representation learning functions, so they try to learn, representations and.

A

B

Way that we we, we optimize them they're massively optimized with stochastic, gradient descent and their objective is to encode domain knowledge. How do I look at the data and try to learn a certain task from this data set and, of course, the domain knowledge can be a variety of things like domain and variances stationarity, and a lot of other stuff will try to decode this statement in this talk so neural networks- basics, as Brandon mentioned this morning, what we try to do with we're trying newer networks and generally we try to build a neural network.

B

Generally, we try to build these models that try to approximate relationships that we have in the data right. So there's we have. We make an assumption that there is some relationship between an input, X or an observation. X and an output Y can be a label can be an action, can be whatever sort of things that you want to associate with the X we're making this assumption that this relationship exists and then we're trying to learn. What is that relationship we're trying to make them?

B

The model essentially learn that relationship so that we can take that model and then apply it in real life right. The simplest way of doing this is to think of, instead of trying to build them, to find the exact right answers of of what that function can be, we can think of a sample. We can think of it in a simpler way. Right. We can break that down that function into atomic functions.

B

We can think of those atomic functions, what they could be and then try to essentially build a hierarchy of these functions, all the way from the input to Y, and we try to optimize to to find the parameters of those atomic functions. The simplest way of doing this is to think of affine transformations over the simplest function. You can think of. Is a linear function right, you take an input, X multiply it by a bunch of parameters, and that's it. That's your output.

B

We add the bias here because we need like what, if the input X is not centered around zero right, so you need the bias, but this is so. This is linear with the biases and affine function. We stack these functions so the the output of the first hidden layer of the output of the first operation. We call it the hidden layer. One goes into the next one. The output of the next one goes into the next one, all the way until the output.

B

However, if we do this, if we just stack a bunch of linear functions after each other, the global result, the final result will be a very big linear function. Right so and that's not very useful. We are trying to learn very complex relationship between X and the output Y, so we do have to have some non-linearity in there. So what we do is we take the output of the first layer.

B

We pass it through a non-linearity, and then we call that the activation I'll talk a little bit a lot about that, but that will be your output of the first day or at the first hidden layer and then that one goes into the next one. So essentially we have atomic functions. We pass the outputs by atomic conscience into some non-linearity and then we build a hierarchy of such operations and that's what we call a neural network. So in this there are a few terminology here to remember this is the input layer called an input layer.

B

The hidden layer is a few. What we call neurons, we will see this in a little bit, so each one of these takes the in this particular diagram. It takes all the features of the input, X and then outputs something that something goes into all the next neurons of all the other layers.

B

Each hidden layer has a bunch of parameters, these are weights and they also have, and then there is an output, their activations. So the idea of activations is essentially that idea of having a non-linearity and, and then your network and the way that it works, is that, as you saw before, we have a number of features, those number of features they are weighted. We calculate their weighted sum, which is essentially by multiplying them by W. We add a bias. We pass that through an activation function and we call that the output of the activation function.

B

This is where the analogy to real neurons comes in the ideas you have done, the writes those dendrites, they collect signal from different places and then the neuron decides whether to fire or not. And then you have the output signals which goes into other neurons bit of terminology here. The number that we calculate the weighted sum is called a pre activation. The output of the activation function is called the activation of that neuron.

B

So how do these activations? Look like not the activations the activation functions, so there is a variety of them and if you look at papers right now, you'll see that only a few of these appear. We will talk about them in in details, some of them at least in details. So if you look at most of the recent papers, you will see that riilu is the most common. A rectified linear unit is the most common non-linearity.

B

Essentially it takes the input and if the input is positive, it passes it linearly at the end, but is negative. It just chops that out we'll see if this is a good idea or not a leak, you riilu has the same thing, but it leaks. Some part of the negative input and the exponential linear unit is is similar to the to the raloo and and in the positive regime. So essentially it's near in there, and then it has an exponential in the in the negative region.

B

The reason that we use any of them or one of them or the other is mainly because of computational efficiency and also for optimization ease. Is it easier to optimize a neural network with one of these versus the other? Sometimes you want to try in your own network and see that one of them works better for you they can the tantor the hyperbolic, hyperbolic tangent you you see this mostly in recurrent models nowadays, I think you will hear more about this later this week. Sigmoids and Contra are also used in as output layers.

B

You see that immediately here you see that they have very nice properties right leave. The sigmoid goes from 0 to 1. You can like, if you want a represent probability. This would be great right. You can take any input and squash it into 0 to 1 the tan tan.

B

Hypertension hyperbolic goes from minus 1 to 1. This also is a nice property that you might be looking for. Ok, so that's a that's! What a neural network is now we said that we want to build this mirror network to try to approximate some relationship in the right, but what sort of relationships can we approximate?

B

There is a theorem that appeared in early 90s. It's called a universal approximation. Theorem theorem says essentially this that if you have a neural network with one hidden layer, it can approximate any continuous function. That is there, given that you can have as many neurons or hidden units in that layer as possible, essentially in their neural network with a linear output unit, can approximate can approximate any continuous function, arbitrary, well, given enough hidden units.

B

So the reason that this is an important result is that we have a theoretical guarantee that if we have the right architecture- and if you have the right capacity, we will be able to approximate that Easterns principle we'll be able to approximate the relationship that we have in data now. Of course, this theorem doesn't mention anything about how easy it is to find the parameters of of such a network right.

B

So you can, you can have random parameters, but you don't necessarily have a method finding the right parameters to approximate your function, and it also doesn't mention anything about it. Also missa Tate's here an arbitrary number of hidden units and that's not practical right, so you might not have enough hidden units to actually represent the relationship that you have okay, so we talked about neural networks as essentially function, approximator z--.

B

We have a guarantee that if we build the right network with the right capacity, we will be able to approximate any relationship that we want in the data set. But how do we find the right parameters for of those neural networks?

B

I'm sure you have come across this trick before you build the cost function right, so the basic idea is that you want, if you have a certain target- and this is all of this- will be talking in the supervised learning I just set up just for illustration, because it's easier to list right here, but it's the same thing and unsupervised learning. You will have some target that you want to achieve. So the basic idea is that you have some function.

B

We call it L, we call it the loss that takes compares the output of the neural network, so this is f, takes an input X and gives you an output. This is the neural network, compares it to the real target target output that you want your network to give. So that's the last function.

B

We assume that the last function, essentially, if it's high, then it's bad. If it's, if it's small, that means the output of the neural network is very close to the real target, and then you average all of that over all of your data set. We call that the cost function, so the cost function is essentially the average over many examples of their real training data set.

B

Okay, there is a framework called empirical risk minimization. So if you're looking at any sort of introductory course and machine learning or in deep learning, you will see this this framework. The basic idea is that what we really want to achieve, we don't want to have our network do very very well on the training data set, but we are really trying to do is to have it do well on a data set up. It hasn't seen before right. This is what we call the generalization error.

B

We want it to actually generalize beyond the data set, that we have the two concepts. So, if we're only trying to make it work on the training data set, that would be called optimization if we're trying to make it work on an unseen data set. That would be called learning right. That's the goal of learning, so the real goal is, is to actually have the cost function on the entire data set to be really really low. The entire data set.

B

This is the actual the data generation, Distasio generation distribution, the original source of your data set, but we don't have access to this one. This would be called the true risk, that's what we are trying to minimize, but what we end up minimizing. We end up minimizing the empirical risk, which is the same quantity, but averaged over the training data set that we have, and we call this the empirical risk.

B

The reason that I wanted to point this out is because this is generally the at least a theoretical framework from where all of this starts we're trying to minimize the real risk on data generation distribution, but we end up doing an optimization over the training data set, and then we hope that it will do well on an unseen data set. Okay. This is a great principle, but it doesn't in reality, it's not very it's good to to think about, but it's not how we build these cost functions.

B

For many reasons, it turns out that most of the of their losses that we're interested in this empirical risk or the risk that that we're interested in most of the time it's not smooth, so you can think of the risk if you're trying to classify cats and dogs. What you're really trying to to say is this image a cat or a dog? So it's a zero one sort of risk you either it's either a dog or a cat. There's no, like you don't give me problems like the real risk, is not probabilities.

B

This is 70 percent cat. This is 30 percent dog. It's either 0 or 1, but the 0 and 1 loss is not that smooth loss right. It's not very useful.

B

So what we end up doing in practice, we end up doing having a surrogate loss function and we optimize over that surrogate loss function again we optimize on the training data set and we hope that we are learning. We hope that we can generalize to an unseen data set.

B

How do we build and then once you have a shower gets lost function? It's very easy to build up a cost right. One. One of the prints great principles is just to use the maximum likelihood.

B

So essentially the idea can you, under this last function, can I maximize the probability of the data set that I have and then so he assumed that you have a P model that gives you Y given X and then log of that that will be the log-likelihood negative that and then minimize it to maximize out this quantity over the training, dataset.

B

Okay, one example of this is in regression. Brenda talked about regression this morning and she talked about the the l2 loss right.

B

I think I'm sure that you have seen the l2 loss in one form or another, but you prowl I'm, not sure if you have all seen where it comes from, at least from a Bayesian optimization perspective. The idea here is that your P model of Y, given X, is distributed.

B

Normally, that's an assumption, so you say that the difference between y and the function and the output of your neural network I want that to be distributed normally or I, assume that the real errors and the data sets are distributed normally, and this is an a good assumption right. It says that if the if the output of the neural network is very close to the real output, it's okay, but if it's very very far, I want you to penalize strongly right. So.

B

If I have P model is a normally normal distribution, I can plug that into the log-likelihood and then P. Remember the normal distribution is exponential to the power of the mean minus the F here, which is the x squared and then, when you have the log, it cancels the exponential and you end up with there l2 loss, and this is essentially how you think of this is using the maximum likelihood to build the l2 loss in a similar fashion.

B

You can also build things like the binary, the binary classification laws, for example, it's a burn or us human cyber knowledge distribution, and you can go through the math and then get your binary loss. Okay, so we built the cost function. How do we find the right parameters? We use again the.

B

Gradient descent right like this is the oldest trick in the book. You have a function you're trying to minimize that function. How do you do that? You take the derivative at the point where you are and then the derivative points in the direction where the function is increasing negative, that that will be the descent direction, and then you take one step in the direction of your descent and you update your parameters all right. So mathematically, you have your WK and then you take the gradient.

B

You you make a step with the size alpha and opposite the direction of the gradient, and that takes you to your new parameters, which will be here, and you do that iteratively until you reach a point that you can call the your minimum.

B

This is in a very simple for a very simple function. This is our gradient. Descent works.

B

Okay. You need to remember that we're talking about the learning great or the step size where this is where it comes in. We will talk a bit about this later. In reality, this gradient gradient descent. When we're talking about just gradient descent, we mean take your entire training data, set, evaluate the gradient on the entire training data set and then make one step this is it doesn't work really in reality right. Your dataset can be millions of images, it's extremely expensive to actually evaluate your last function on the entire data set.

B

Another thing is that you don't want your training, the complexity of your of training or optimizing. Your network to grow, as your data set, is growing right if I am, if I'm, essentially increasing that I'm evaluating the entire gradient on the entire data set. That will be all n complexities, linear complexity, but you don't want that. So in practice we use the gradient that stochastic gradient descent.

B

We say instead of using the full gradient, let's evaluate the gradient approximate that or with just a small number of of examples from the dataset, and we hope that that is you know it's good enough. It will give me a good idea of which direction to go, but the one I rely too much on on the gradient. This is what we call stochastic gradient. Descent is stochastic because those examples are presumably random, so you're picking them randomly.

B

You don't want to have a lot of correlations in the in the randomness of your gradient, so this came in the beginning. It came out as an idea for how to do this iterative process of doing gradient descent or optimization much faster in practice. What we realized is that the noise that you get from the stochastic nature of this gradient estimate, essentially the difference between this gradient value from small small number of examples and the full data set. It turned out that that noise in itself is extremely important to optimize these neural networks.

B

We will see and we'll see, a plot later of how the last function, the surface of this last function, might look like, so essentially that noise, at least intuitively. It helps to kick your network or your parameters out of local minimums, so that it goes to a more a global minima and in fact it turns out that the larger the batch that we use the more problems we have in finding there are a good many minimizer of the entire network.

B

So you will see I think during this week you will see a lot of discussions of large batch training. How do I do training with a larger batch okay, so two things to point out here is that the learning rate and the mini batch size. How many examples do you want to use in every step? These are hyper parameters, and these are examples of two hyper parameters that are extremely important to find good parameters to train your network and then we'll talk more about this.

B

In my talk and then also in other talks, there is a hpo talk later in the week. Hyper parameter, optimization talk that discusses just how to do this. This stuff, generally right now to first-order using a small batch somewhere between 1 and 32 and powers of 2, is reasonable. This is what you will see in most in practice. Once we have once the community has experience with a certain network, you start seeing larger and larger batch sizes, for example ResNet. You will see that most of the time people train with 256 batch size. Yes,.

B

Yes, so the question is if it makes sense to change the batch size during the that descent. This is an excellent question.

B

Actually there is a lot of research recently that is doing just that, but so essentially it says that when I am when I start from random parameters at the very early stage, I want to have as much noise in my gradient as possible, and then I use a small batch size to make my steps I'm still exploring trying to kick myself out of all the local mini months, but once I get to a flat region and the last surface I can take a more confident. There are less problems.

B

I can take much larger steps and then I can also have I use essentially larger batch sizes. There is a lot of. There are a lot of results on this I think as soon as tomorrow, you will be seeing slides showing that sort of. Thank you any other question.

B

Okay. So how does this look in practice? I just want to emphasize that yeah. When you use learning great that are off, you will get differently lost, curves or learning curves, and you will need to really need to find the right learning rate okay. So how do we find the the? How do we actually do this in practice? So if you, if you look at, if you look at examples trying to visualize, there are a lot of these we're trying to visualize the last surface of a real neural network on a real data set.

B

You will see examples like this. So I think this is for vgg 56, which is one of the winners of the imagenet competition standard model. People have done a lot of stuff with it, and this is a visualization of the lost surface at a certain point, during the training the way that they do this, they try to find two directions in which the last changes the most and then try to visualize it, because you know these networks have tens.

B

If not hundreds of millions of parameters, you want to choose two directions to visualize, to make a surface like this. Unfortunately, we can plot in more than more dimensions. So- and you get something like this, you can immediately see that sort of trouble that you can run into right. You can get stuck in a lot of local minimis if you're, if you're learning great, doesn't doesn't essentially kick you out of these local minimums. If you don't have enough noise, you will not get to a location like this right.

B

You can also see that you can get stuck optimizing like and in certain places you can get stuck in saddle points right instead, just like you know in your in your pace, you can also see immediately here that, if, if you're, for example, your parameters are somewhere on the surface and then you're you're lost you're, your learning rate is very small. You won't travel far from where you have started right, but if your learning rate, it is very large, it can essentially catapult you all the way to arrive in somewhere far away.

B

So essentially one can look at this and imagine all sorts of scenarios for where things can go wrong right. So how do we do this in practice? How do we actually optimize our surfaces like this? The first, the first place to start is with the optimizers.

B

There is a range of optimizers people, don't use just gradient descent in practice. The first thing that you can think of is that you can have a momentum right. So if you have a ball rolling down surface, you can, instead of trying to do only locally trying to make your step based on the local gradient. You can accumulate your speed right while you're coming down and then use that to kind of give you a sense for where you should go. The general direction right, so stochastic, gradient, set with momentum, would be.

B

The very first thing to do. Nesterov is a variation over this, which is essentially UI. First update my location based on my velocity and then evaluate the gradient or or evaluate the gradient update, and there is a range of other things like other grad and rmsprop. Essentially, they try to use the size of the of the gradient along the way to estimate by the size of the step that you're taking once you get into other grad and rmsprop. We start having different learning rates for different crimes.

B

Different updates scales for different parameters- and then you have Adam Adam, is- is essentially doing somewhat of Armas prop Plus momentum, so it combines the two ideas and then it also tries to eliminate any bias in the estimates of the estimates of the the gradient mean and and variance. This is very high-level, I'm gonna, probably if we have time we can get into the details of these different optimizers tomorrow, you can see in a plot like this, that some optimizers, for example, pure as GD, gets stuck there.

B

Of course, this is kind of a diagram just to show in principle how this happens. If the learning rate is different, it might not get stuck right if there is some noise and the gradient, it also might not get stuck. But this is just to illustrate the idea right and you can see that other sort of optimizers.

B

They accumulate momentum. They use all sorts of things to to modulate the size of the step size for the different parameters, and then they can actually come downhill very quickly. Yes,.

B

That's true, so, in practice, what we have realized is that a lot of these optical parameters, these optimizers, they make it easy to optimize network. If you don't know what parameters to use so, but in practice the best sort of generalization error comes when you use as gd+ momentum, but it takes a lot of hyper parameter, optimization to be able to find the right parameters.

B

So I think this is also something that Brenda I mentioned. Is that when you use something like Adam, you don't worry a lot about the exact value of your learning rate, but the best value. At least you can start experimenting with the rest of your model without having to worry so much about this being completely off. It's less sensitive to the exact value of the learning rate. But if you look at most of the the state-of-the-art results like models like resonate, for example, you see that they actually use SVD plus Compton.

B

They don't use any other any of these other optimizers. So Adam is a good place to start with the default parameters of Adam, which is the learning rate 1, P, minus 3, okay, I.

B

Don't know maybe someone else does so yeah III think there are a lot of papers actually in the in the in the world of optimize that there are a lot of papers that try to to understand why those other optimizers work better and why LCD works where most of the time it ends up being that you can find I'm not going to make stuff up so I think this is something to check in the research.

B

The other question, okay, so I said we have a loss function. We take the gradient of that loss, function with a certain parameter and then the parameter of the network, and then we take one step opposite to the gradient right. But we have a lot of parameters in these networks. How do we get the parameters to? How do we get to the parameters inside the network themselves? Not at the very last layer and the output layer?

B

For example, we use again the oldest trick in the book, which is the chain rule of calculus to propagate the errors from the last function, all the way to the parameters that were trying to update. So imagine this is the output, Z and you're trying to get you're trying to update W. You need to actually pass through the entire network. Partial C by partial W would be partial, X, partial W, partial Y, partial X, partial Z, but partial Y.

B

This is, if you're taking a class like CS 231. You will see that they spend at least a whole lecture an hour, 15 minutes talking only about back propagation, how you do actually this in practice, there are a lot of things that you want to to take care of. Essentially how do you do it efficiently on linear ax, on on modern ax aerators, but for the conceptual understanding of how things work?

B

All you need to remember is that there is a chain rule and your gradients are actually propagating to all the other factors that you have in your network, and this is very important because imagine that one of these is zero. Imagine that, like one of these is zero or one of them is extremely small. You will not have any gradient signal going back to area layers right, which would kill the update to 2 W immediately can also think of. If this is extremely large, it will also support the whole thing off.

B

You will get an an immediately, so this is an important thing to remember when you're doing optimization in practice. Ok, so promise to get into the activations. We had to go through that idea of the back propagate before we get into this, so real ooh, as we mentioned real ooh, is.

B

Is the most common now activation function and the networks that you see around and the basic form of the function? If it's linear the input is linear. If the input is positive, you go to a linear response. If the input is negative at zero, there are a lot of good properties about this. First of all, it's computationally cheap. It's extremely cheap right.

B

The other and compare this to the sigmoid that has an exponential, we'll talk about this in a bit it has exponential, are very expensive to calculate. Initially, people thought that sigmoid would be a good way to do it. The other thing is that, when it's positive, the slope of this function does not alter the the slope of the actual output of the neuron. It can give very strong gradient signals right.

B

You see, it's not dying like other functions right, so if the slope here is good to for a gradient propagation, one of the issues with riilu is that if the output is negative of your neurons negative it essentially, it kills. The EO kills the output, but also kills the gradient right. We just said that if this has slope 0, so there's nothing will propagate back, so this leads to dead neurons.

B

Sometimes this is a good thing, so sometimes we say like well we're looking for sparse representations and sparse response and stuff I mean that case. This is a good thing, and sometimes it's not it's. If you have too many dead, neurons you're, not learning right.

B

So one way to get around this is to use something called leaky riilu, which is essentially keep some. You use some some part of the negative negative part of the of your input, so you have Alpha X alpha times. X X would be negative here right and then alpha between 0 & 1, which is essentially you keep some leakage in your function so that gradients can propagate back, and this is this is very. This is very important in practice.

B

You will see that one of the ways to actually monitor if your network is doing well or not, is to look a lot on how many neurons are not dead. Okay, two other activations I want to talk about the first one is sigmoid so sigmoid. We don't use it inside the neural networks in the head and after hidden layers, so inside the neural networks anymore, you might find it that's it. You might find a lot of things, but in practice we use it to represent probabilities.

B

So if the output of my neural network has to be somewhere between zero and one, it's very easy to just take the output of the neural network. It should be on the x-axis and then that will give me it will squash that x-axis into 0 to 1- and this is great right for representing Bernoulli distribution- it's expensive to you to compute. However, if you're only using it at the very last layer. It's ok right.

B

One thing that I want to mention about sigmoid. Is that actually they only say so a lot of what you will see right now and stuff that you think? Oh, this is great, and this is what happens to all of us. You understand the different details, but after you get into deep learning in practice, especially if you are in application, applying deep yearning for science, not doing deep learning research, you get into the practice of doing neural networks as plug and play. So you say: oh, this is I.

B

Want this output I'm gonna try with the output, which is between 0 and 1, and that is a sigmoid function, gives me 0 1, that's very nice and then you're gonna say: oh I'm gonna try different classes, I'm gonna, try maximum likelihood, I'm gonna, try, l1 I'm gonna, try l2, and there are fundamental reasons for why that's a bad idea to use any random loss with any round activation function. One of them is this one, so you can see here that the output of a sigmoid it has extremely vanishing like this is.

B

This is good like the gradient here. The slopes here are good, but if you get to very high values near very high values of X, the gradient vanish immediately right and also it's very like you mmediately, even as by minus six, you already have almost zero slow.

B

So that's very bad for learning right if your network is stuck with very high X values, you're, not getting any gradients back. So it's a bad idea to use a sigmoid if your loss function does not have a log to undo this exponentiation.

B

B

B

Yes, I see your question, so the question is release. They seem to be linear because they always output everything linearly right, except when it's positive, when it's negative and the answer and how come this is a nonlinear function, the answer it is nonlinear, because the fact that you're actually killing the negative part produces all of these farce representations when you compose a lot of them after each other you're creating you know, nonlinear big nonlinear functions and the idea of riilu by the ways it's also connected or inspired by nuance.

B

You have done the writes and then they either respond or they don't respond some time the whole neuron might respond or not. And then, if it's just a zero one sort of response you still create, you can create nonlinear functions. Okay, so the point I was trying to make here if you're, using a sigmoid. Remember that you have an exponential in there. You remember that you have this vanishing gradient and then remember that that needs a log to undo the exponential.

B

B

So what are the properties of probability? So you have the the sum of our over all possibilities would be one right, the value will be positive and what else I think I think all of them are satisfied. I think the value yeah, sir.

B

That's a that's a great point so and that's exactly what what we what we talked her when we said like we have a surrogates loss function so we're assuming that the we're assuming a particular distribution and then we're trying to find the parameters that would actually match that particular distribution.

B

So, in this case, like Bernoulli, we are assuming that the data is the the actual output is distributed according to her knowledge distribution, we're trying to match that it's it's much better for learning, it's easier to actually optimize networks with these distributions, rather than to try to optimize on the original distribution that it's like 0 1, for example, that we're looking for I. Think it's it's easier to illustrate this with the softmax, because you have a you, have a so the question. So ok, so we talked about the sigmoid.

B

It's great to to represent binary sort of probability is some number between 0 and 1, and the question is whether why is this? What if this doesn't actually represent the data set right in your data? You have it's either a dog or a cat. It doesn't have probabilities in between and trying to say that we, this idea, is connected to the idea that we're using surrogates loss functions because they are easier to actually handle in the optimization process.

B

Now, if what, if we have multiple categories, not daksa cats and dogs, but we have ten ten different categories. What do we do? So we again assume that there is something the data has some distribution. In this case. We assume that there is a multi, only output distribution, and this will be multi class out, but in reality the real rest things like this, for example, if you're looking at colors the colors are color, is not the best to explain this, but categories of animals.

B

Your categories are really like: it's either this animal or that it's not probabilities over those. However, using this idea that we have a distribution over all possibilities makes it easier to actually optimize the neural network. So, when you're looking at an MS trying to identify the animal in that image, you assume that there is a multinomial distribution and they have different probabilities of being different animals, and that is not reading the distribution.

B

The data you they only belong to one class.

B

Does that kind of answer your question? We can talk about it later: okay, so multi, no lis distribution. The idea is, how do I, if I, have a multi-class problem? What sort of loss function do I use? What sort of how do I guarantee that the output of my network can actually give me something that tells me which, to which class does the object? Does the input belong and the idea is to use the softmax function?

B

The softmax function says essentially exponentiate all of your the output of all the layers, so all of these layers exponentially at their output and then normalize them by the total right, so that and that would be the output for every category, and this is essentially gives you. The sum of this over all categories would be one because that's the denominator right and then the value of each one of these would be the probability that the input belongs to any of the objects that you have, so it does produce a distribution of our classes.

B

This is different from trying to tell if two different objects are within the within an image or not like, for example, if there is a probability that there is a dog and there is a cat. In that case, you want to use a soft max. You'd use a sigmoid over two different neuron outputs, okay. So the way that you, you use the information that comes out. You say that the class with the largest probability, that would be the class of the object. So that would be the guess of the network.

B

The if, if the data is noisy, that your labels themselves would be noisy, so the neural network would try to do whatever is the target that you have in your own data set? If and that's a question like that, comes back to you that can you tolerate that noise in your data set or not, and if you can tolerate it, how do you want to handle it? If not, you want to remove it, and this goes back to yeah. So I think. Does that answer your question?

B

Okay, maybe we can get to this and after the talk, okay, so one thing I want to mention here is that the softmax is also an exponentiation.

B

You want to remember that so don't use a soft max with an l2 out somehow so make sure you remember that this needs a log to undo the exponentiation by the way, this classification using soft mask for classification is not only for classification a lot of times when we are doing regression remember when we're doing regression most of the time we end up using and l2 and then a lot of the times we end up.

B

Instead of using regression, we classify our output, we know that the output has to have a range between minus ten and ten. We divide that into different categories, and we try to use a soft max to predict a distribution over over these different bins, rather than try to predict a particular number, and this is for many reasons.

B

One of them is that when you're doing regression and you're trying your target is just that one number, it's a very difficult task for the neural network to learn that- and this is even though, that you are using a normal distribution and the number is very close still they don't perform very well.

B

The other thing is that the when you're using you're doing regression and the output of your neural network is linear, you're, really trying to you're asking the network to output that particular number, but when you're, when your output is exponentiate like this all you're asking the neural network is to guess some good number right, because because the number really, if it's, if it's maybe three or ten, it will probably give me the same category. It will give me that in the same range, so the neural network has much bigger range to play with to output.

B

That number. You will see that a lot in practice that people instead of doing the regression they would divide it into bands and then do classification over depends.

B

So if, if your data is wrong, if it's, if your label says it's green and it's not green, then it will just memorize that and your learning will be bad. It's you're not going to learn something that generalizes well, if you have, if you, if the data is just noisy, there's a distribution over. Maybe this is like you're showing a certain color.

B

There is not a standard color to 100, and each one of them gives you some some sort of guess of what this color is, and that is this to be actually represented in the data set. The neural network will try to learn that distribution over your noise.

B

These are two cases but I'm sure there are so many other cases in between that. You need to handle them one by one. Okay, so I want to get into I. Think I have 20 minutes. Okay have 20 minutes ago, so I'll get into a few topics that are important before we carry on with the rest of the day. First of all, Brenda talked about.

B

How do you monitor your neural networks, learning and I'm sure this idea that you have also seen it before right again, what we're trying to do we're not trying to optimize we're trying to learn so we're trying to generalize beyond the data set that we have. So if you look at the error of your front viewer, your neural network output or the last function or any surrogate for this quantity, and then you evaluate that on your training data set, it should be going down all the time.

B

So optimization has happened right, however, we're trying to generalize to some other data set. So we usually split the original data into three categories: a training data, a validation data and a test data set. The test data set. You just hide until the very last day before you submit your papers, so very last night, don't even bother trying to do anything on the test data set. This is extremely precious. You really don't want to do it.

B

You don't want to look at it at all until the very last minute, the validation data set I prefer a different name for it's called the development data set. Sometimes you even spread that validation into two data sets. One of them is called development. The other one is validation. The development data set is something that you use to tune your hyper parameters. The the validation data set is to monitor the learning process tuning the hyper parameters. You don't want to over tune them to.

B

Actually you know your overfitting to the actual validation data set that you have that's. Why? Sometimes you you keep a development data set aside. The breakdown of this, the general consensus is 80/10/10. You will see a lot of 60 20, 20 or any other variation I prefer to have as large of validation and test data sets as possible, because you really don't want to fool yourself right. Like you don't want to come out, oh it's great you're.

B

If you're really trying to do science, we want to have confidence that you're really doing the right thing or your network is doing the right thing now. How do these these curves look like? There are multiple regimes: if you're, comparing to the training data set to you, there is the and the validation there's the regime where you're, essentially your model is underfitting. Underfitting means that you're not even doing well on the training data set itself, and that has there are many things that you need to check.

B

We will get into some of these tomorrow, but the very first thing you are: you want to see if you have a bug in your net in your code before any of these, if you're underfitting after that, you want to probably check your model architecture and stuff. Actually before that, you, you want to check your learning rates before the model architecture, because, if your learning rate imagine if it's extremely small, that means you're not really getting you're, not taking large steps.

B

You know that are large enough to get to a minimum and it just gonna take forever to get to to a meaningful minimum. So you wanna check the learning rate after that you probably want to look at the model architecture. If this check out, you probably want to either train longer or look at either other hyper parameters. These are of no particular order at the moment, but tricking the learning rate would be the very first thing. I want to emphasize that the validation error and the training error being close to each other.

B

All the time is that you are underfitting, so even if your you have trained for three days and they are still extremely close to each other, you're still underfitting very likely you're still underfitting, you need to check what's going on especially check the learning rate. The idea is that if I am NOT underfitting, if I'm, not under fitting the training glass should be, I should be do much better on my training data set than on my validation visa, because these neural networks they're over optimum parametrized, they are extremely powerful.

B

They can actually over fit any data set. We will see how to make use of this later. The other regime is the overfitting regime. This is the regime where we're trying to we're trying to get to this one so that we know where it is so that we can stop the optimization just before we get there right.

B

So the overfitting regime I do very well on my training data set and not that well on the validation that kind of comes down and then start climbing as soon as starts climbing I know that I started overfitting. This is what we call the generalization gap, and this is what you spend most of your time, trying to close this generalization gap, we're trying to move essentially this point to much further along.

B

What do you do when you are overfitting? There is a lot of things that you can do. We will talk about these tomorrow.

B

Some of them is essentially the learning rate where there could be problems with the learning rate. The data set could be not enough right. If you don't have enough data set, you'll probably want to increase that there are ways if you can't go and collect more data. There are ways to augment the data set that you have. You can do that and then, once all of these things check out you're welcome to deep learning, you start doing regularization. So regularization techniques are the stuff that we will talk about tomorrow.

B

Essentially, how do I make improve their generalized generalizability or the ability of my models to generalize to an unseen data set we'll talk about this tomorrow and then, if all of this check out you probably and you're, if you're, still overfitting and you're not satisfied with your validation error, you probably want to reduce the model complexity, yeah, so yeah, so you you want to 101 way is to check the number of either yeah.

B

If your it depends on the problem, really it could be that your model is not learning really as fast enough, so you probably have connections that have problems on them. You have bottleneck layers that are too narrow for your model to learn, but in the general sense, yes, you want to check the number of layers that you have the sizes of each of the layers and all of that, if you're not overfitting to the training data set, there is something terribly wrong. We'll talk about this under fitting.

B

You should get out of that regime very quickly. So that's that's your first! That's your first couple of hours, we'll be working with the model you should get out of that regime, and then you spend most of your time doing essentially trying to push this point to further.

A

B

Point this is what we call the early stopping point, which is. Essentially you want to stop the training as soon as your validation error starts climbing and it's a sort of a regularization technique. We'll talk about this okay, so I have 15 more minutes and I want to get into the last topic for today, which is convolutional neural networks.

B

So you can build all of all sorts of functions, so we saw a set up where we have these fully connected, dense layer functions, they take the entire input and then try to give an output. We have a lot of them stacked after each other. We talked about the universal approximation theorem. We said that this particular function can approximate. If, given enough capacity, it can approximate any continuous function that there is.

B

However, this far that particular set up with the dense layers doesn't assume and think about our dataset right, but our dates that we know that there are certain things and that data said that we know that they're. Absolutely true. For example, if I have, if I have objects in my data set, I know if those objects are in the upper left, right, upper left or upper right or anywhere they appear in the data set, they should be the same same cat, for example right.

B

It's not gonna change, so we know that the stuff that I'm looking for in Miami our translation, invariant, it's like their translation invariant, so essentially, wherever they appear, they're gonna stay the same.

B

These sort of information or knowledge about the data set. This is what we call a prior knowledge and then a lot of the work that goes into building neural networks into how do I incorporate those prior knowledge or these constraints or knowledge about the data set in the actual architecture. That saves me a lot of things. First of all, I'm not trying to solve an extremely general problem.

B

I'm not trying to you know you know, kill a fly with a camera, a Canon or something right like I'm, actually using the right tool for the right job, and this is not only for we'll talk about. How is this is achieved in currency announce, but this is in in science is actually more its. We see that more often than in other other areas. You can imagine, for example, Sciences background or in physics we always talk about rotation groups right. You know that all of our objects are rotationally invariant the molecule.

B

However, it looks like butter their protein structure. However, it looks like it should stay, it's the same object or the same protein right. So how do you build a neural network that respects all of that? Those in variances when we incorporate those in variances as infinite priors in the architecture that we have? We tend to learn first, essentially, models that general lines much better they're, not going to respect that on your training data set and then disrespect or violated on the validation data set.

B

We know that at least those in variances or those priors are not violated and also the models tend to be more data efficient, efficient, so I can use much less data to train the same network.

B

Okay, so all of that intro set. We will see how cnn's achieve essentially that, so this is fully connected networks that we talked about earlier today. You have your input is in blue and every neuron is connected to every single one of these inputs, and that's why we call it a fully connected Network.

B

One thing that we we can do is look at. We do sparse connectivity right if I am looking only if I am thinking that only the local sort of information is important and I don't have to correlate pixels or inputs features that are far away from each other. I can, just you know, have a local connectivity. I can have more neurons each one of those neurons is only locally connected to a few input features. I can do something further than that.

B

If I know that the stuff that I'm looking for are the same wherever they are, it doesn't really matter where they are. In the end, the input I can share or I can reuse the same parameters for all the detectors right instead of having different parameters here, I can have the same parameters everywhere, and this is what we call parameter sharing. So what does this bias essentially bias? The idea of translation invariance right now, I have a instead of having something that takes the entire input and tries to output the gigantic output.

B

I have a much smaller kernel or small parameters or a small detector that is sliding over the input and trying to produce and tell me in this patch what response should I have to dispassion this patch on this patch and I. Do a lot of parameter, saving right, any less parameters, and this is the idea of convolutions desire, communes work.

B

Essentially, you have a bunch of parameters, you slide over your input and then you do a dot product with of your weights, with your input right the bias, and then you move on to the next two. You slide the kernel over and you move on to the next thing again. What does this achieve? We have sparse connectivity, it's only local responses. We have parameter sharing. We have way less number of parameters, we'll see an example in a bit, and we also have translation in covariance.

B

You can check what the difference, equivalence and in variances later, but generally, the basic idea is: if I have an object in mind. In my input it doesn't matter where that object is I, don't need to learn different parameters in different places. This is again, this is an infinite prior on what sort of data that I am looking at I'm, saying that my data has that property of being having objects being translationally invariant few terminologies.

B

This is my input matrix. We call it an input matrix. This is a convolution kernel or filter, so you will see that people if they say a kernel, that means it's just a bunch of weights that you multiply by the input they can. It's also called a filter, so you'll see that everywhere and then the output is most of the time people say a feature map. Sometimes people say it's an activation map, an example of how this works. In reality, you have an image. The image is 52 by 32 by 3.

B

The 3 is the number of channels, red, green, blue, and then you have a filter. The filter tries, for example, a filter 5 by 5 by 3, and this slides over the input. The your filter has to have the same number like 3 to match the channels right. So essentially, you will have 3 sets of 5 by 5 the parameters to process your input and then, if you do the math you'll, you will see that your output is 28 by 28 activation map and it has one channel here.

B

You can immediately see with these numbers that, if I have just to ingest this input, if I was using a fully connected network that will have at least 32 by 32 by 3, that's like 3000 or something and this and then multiplied by the number of outputs that I want to have, but at least for one neuron that would be about 3000 parameters and when I'm using a convolutional kernel. This is 5 by 5 by 3. This is 75 parameters. There is almost two orders of magnitude reduction in the number of parameters.

B

Immediately from here, you can stack one of these filters to have multiple outputs. So essentially, if you have one, you want to six activation maps as an output, you will have 6 times 5 times, 5, convolutional kernels and each one of them will have 3 channels if they empathize 3 channels.

B

So that's. These are the general basics things about convolutions that you need to know. If you look at convolutional neural networks, you will see that there is another type of layers that we use. The essential idea is that I have a lot of these activations are coming out as I am coming and coming down the the pipeline of convolutions.

B

Sometimes we want to to to reduce the size of these feature maps that are coming out and it's interested to okay, you have, you are outputting 28 by 28, but I want you to summarize that into 14 by 14. So and to do this, we use what it's, what we call pooling layers so essentially the pooling layers they replace. This is an example of a pooling layer with a kernel size two.

B

So essentially you look at the this is I'm looking at 2 by 2, matrix and I'm, then I'm deciding how to summarize this 2 by 2 matrix to another 2 by 2, matrix I'm, trying to summarize that I can do average pooling whether I would average all of these numbers, and one number I can also do max pooling, well I, just output, the maximum number in this 2x2 matrix. What does this?

B

Why is this useful? First of all, it reduces the the size of these activations right, and this is very, very, very useful for all sort of computational and optimization needs right, like you can think like. It would be probably easier to optimize the network if I start getting less and less summary sort of features. The other thing is that it has some sort of local invariants to small variations.

B

If my maximum is, if my max okay, if my maximum is here- or here or here, it doesn't really matter, I just want you two to get it. Try it approximately so having your your entire network not being extremely sensitive to local variations is another sort of an infinite prior. They were just saying that you know my data is, it doesn't really depend on the exact location of that pixel. It depends on the global.

B

You have to pay attention to the global picture, and this is this is what cooling achieves another way of reducing the size of the activation maps is to use strided conclusions. Try the Commission's I just wanted to mention this quickly either just before they talk.

B

That's why there's no text but okay, so the the basic idea is to do convolution, and this is not moving okay, it moves somewhere else, okay, so the basic idea is to you: have your kernel and instead of stepping one pixel at a time when you're, activating your stepping, for example, to will try to two pixels at a time, and in that way you can reduce your input into some feature map with a smaller size.

B

The trend people usually use max pooling or average pooling, but both of them make an assumption about. How do I want to summarize my future maps into something smaller? Why people started using Strider convolutions? The basic idea was: let the network learn whatever it wants, how to summarize the information.

B

So you will see that there is a strain of neural networks called the all convolutional neural networks, where there is no pooling layers whatsoever, and then all of that summarization happens using strided conclusions, then you can put all of that together and build an extremely large neural network. This is a neural network. Convolution, your network is an input.

B

Okay, I think this doesn't work anymore. So this is an input it's 24 by 24 by 3. There is a two convolution in layers and then max pooling. This gets us to another set of feature. Maps you see the trend is that usually they they have less spatial dimensions and then more depth for this output or more number of parameters.

B

The way at least intuitively. The way that you think about this is that the depth, the number of parameters are, the number of a feature, sort of coordinates or so red-green-blue would be, that's three parameter input and then the output of the first layer I want you to summarize your input into something along the depth dimension, which is the number of filters. This is sort of vague. We can talk about this level this a bit of this later. One thing I wanted to mention before we finish- is that what do these networks learn?

B

There are all sorts of ways to try to visualize and understand what these networks are learning. This is an example of a very earth. This is actually this is Alex, and this is the very first neural network that has won the imagenet competition. If you go and you try to visualize what would be the input that would maximize the activation of different different neurons at different layers, you would see that they have actually distinct, distinct features.

B

You will see that very early on and they would they tend to have edges and huge here if you're looking at different colors and then by the second or third layer, you start seeing that they activate on texture and then by another layer they start becoming like they have more features that are present things that we might know in reality like cars and and dogs and stuff.

B

You can see them here and then and then you know we have a final layer here that takes all of those features and tries to classify them into 1000 objects of imagenet, and you start seeing here the stuff that they can recognize.

B

You can think there are many ways to think of this, but you can think of them as essentially templates right, so I'm doing template matching over my input, but there is one property that is universal to most of the networks, at least as far as I know.

B

All the networks that we have seen is that the very early layers they tend to learn simple features, simple motifs, like edges and blobs and stuff to start learning texture, and then they start composing more and more abstract sort of filters or templates by as the depth goes on by the end of the layer, and this is this is actually a very important result. We will talk about how this is important for transfer learning tomorrow, but it also it's it's important from the prior knowledge idea that we were talking about earlier.

B

We said that we try to incorporate as much prior knowledge in my our networks as possible. One prior knowledge is that the world that we live in is compositional right things. Bigger things are composed of simpler things. You don't have to understand how a dog is made. You just have to understand, like all the little edges that when you combine them together, they make a picture of a dog. This idea of that the world we live in is compositional is extremely important. We might talk about it tomorrow.

B

Yes, that's true, so I think you touched on so many points. It is true that most of the time, neural and CNN's tend to work much better than any other architecture, and it could very well be because we know how to optimize CNN's. We don't know how to optimize any other architecture. Humans have evolved to optimize their hands over the past five years.

B

The other thing is that the other point that which you mentioned, which is if we are able to design different types of architectures, we can do a search over the atomic components that we have. So we know CNN's, we don't skip, we have skip connections, we have fully connected layers.

B

We we can do your architectural search over these, and this is like what Otto I now feel does, but if we can really find a way to incorporate an automated automatic way to incorporate our high of knowledge about the data into the architecture, I'm not aware of we work like this.

B

Okay, so visualization understanding what these neural networks work. There are three excellent articles by Chris Ola on distill the table. They are amazing. They have a lot of interactive sort of diagrams and features and stuff that they're really amusing to look at so. Finally, you just wanted to again I hope that this has been Illustrated the samples the what we are looking at are as a family of parametric, nonlinear and hierarchical representation learning functions. They try to learn representations from the data we optimize them using stochastic gradient descent.

B

We you will be taking hands-on classes today, or at least one class and Josh will talk about tensor flow ecosystem as a framework I want to give you two practical tips before tomorrow. We're going to talk about more things tomorrow. First thing is that how do you debug, ignorant network is very, very tricky and it takes a long time and a lot of a lot of experience to the point that there's one person who is very famous in the deep learning community when he got his last promotion.

B

That Google has reported for two years of work was I I, don't only propose neural networks, I know how to make them work, and this is I know that it sounds funny, but it's, and that was the entire report, and he got this promotion. This idea of being able to actually optimize your network and make them work is not is not a simple business and we'll talk about a lot of this tomorrow.

B

You will attend a lot of these talks. You will see a lot of reports and blog posts online, but that is not going to give to actually replace actual hands-on experience. So two tips, the first one is that try to at least I know that it looks like neural networks of black boxes, but there are things that I know that they should be able to do or things that are predictable.

B

First thing is the value of the loss at the very, very first step, so if I'm classifying over 10 different objects, I know that the output should be random uniformly distributed over the 10 different objects. If it's not, then there is a problem right. So the minus log 1 over 10, which is around improbability, should give you 2.3. You can go and actually do that. This is encourage. You'll see how to do that later today.

B

This is I, don't have my pointer anymore, but this is essentially I'm doing a fit over a single batch of data 0 to 32, and then the batch size is 32, so try to fit a single batch of data and then see what your output is. You should be able to expect this number. This is you'll, be surprised how often this number is often because there is something wrong. The other is that you need to be able again. We said these are extremely large neural networks, they're over parameterize, they're, very powerful.

B

They can learn anything in the data. There are actual results paper results showing that they can memorize entire noise data just generate random sample of complete noise data with millions of images they will be able to memorize them. 100% recall accuracy. They should be able to memorize a small part of your data set. If they can't, then there is a problem. You don't need to try to do anything else, so the very first thing you need to do is train on a single batch.

B

Just take one batch train for whatever number of epochs here, I'm training for 1000 epochs and then see observe that your loss function is dropping and it's actually dropping very fast and you get 100% accuracy extremely quickly. If you can't get the past this point, don't even try to do anything else. There is something wrong either your initialization is wrong. Your learning rate is wrong. You're, not really training whatever it is so and you again you'll be surprised. How often this you know saves you hours and of work with.

B

That said, there's more of these steps in the upcoming lectures and think we took any a lot of questions. You have any other questions. The thing is in principle. No, so in principle, if we have the perfect optimizer of optimization algorithm, we will be able the fully connected network will be able to learn the same sort of function that you learn with a convolution your network. But even if you have that perfect optimizer, why would you want to do that?

B

If you know that that sort of prior about your data is there and it's true, why do you want to waste computational resources? The second thing is that data efficiency it's much easier to optimize when you have these priors incorporated in the architecture than not. The third thing is, which is the very first one. We don't have such an optimizer. We don't know we use gradient descent and you're unlikely to be able to find the function that you get to the convolution neural net or Crom a fully connected Network.

B

So the question is for how about rotational invariance. We talked about the translation, invariance, that's a great question and I think a lot of people have up or are working on this. You probably am not sure if you familiar with a capsule net. That was one attempt to try to do this, which is essentially to have a capsule that at least locally respects rotation invariants.

B

My understanding right now is that turned out that it's not very easy to optimize. It's not as easy to optimize as the ones that we have right now, a simple way of trying to get rotation invariance, which is not really rotation. Local rotation variance, is to do data augmentation and force, essentially rotate. Your entire data sets on the fly. It doesn't matter if the cat is upright or or tilted or whatever rotate it on the fly and enforce that by changing your data set itself.

B

So that's how we do it in practice, there is yeah, there's also a slew of other architectures that try to get translation invariance in or at rotation invariance. In physical sciences, we have tests met and a few other people here at the lab are working on. Essentially how to incorporate certain group symmetries not only like simple so3 would be one of them, which is just the rotations along. But how do you do like other stuff, like rotation and translation and stuff?

B

There are people working on this most of the time it ends up being it's not as we again. We know how to optimize CNN's. We might not know at least immediately how to optimize these new sort of architectures.