National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 07 - Practicalities in Deep Learning - Joel Hestness

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

So up next we've got joel estes adults going to talk about some practical kind of things. Joel is a senior research scientist at Cerebrus systems, so services there's a startup that does some specialized AI hardware, pretty cool stuff, I'm, not sure how much we can say publicly, but it's cool stuff, so at Cerebrus he helps to formulate strategy to support practitioners users on the on the hardware. He also leads some natural language research.

A

You can ask him about that later. I! Think I, don't he's gonna cover anything related to that, but maybe he will before that he he has worked at by do at the Silicon Valley AI lab worked on scaling, deep learning. We're gonna be talking about that. Also later this week very interesting stuff. It is PhD at uw-madison and generally has broad experience with a lot of relevant stuff for us, so computing applications, numerical methods, graph and graph analytics and machine learning. So, let's, let's thank Joel for coming today and.

B

Steven I want to thank everyone for having me here today to this is kind of an honor. It's it's great to be here so today, I'm going to gonna, formulate my talk around a very large scale. Research study that I helped run it by do and I'll give you some of those results coming up here.

B

But the the the kind of unifying theme here will be when, when everything is clicking, you're going to be able to get your datasets to scale your models to scale and you're gonna be able to improve accuracy, and so hopefully what I'll be covering in here is a bunch of the things that you'll run into maybe run into as you're trying to scale up and get to like state-of-the-art accuracy.

B

So it Stephen said I I recently moved from by do research to cerebral systems, I have a background in heterogeneous system, design and optimization, and so some of the my past life was in performance, optimization for large scale. Applications and I've been doing deep learning research over the last three years or so I want to kind of set up the objective of this lecture and hopefully, maybe get some nods if people are if this is interesting to people, so I can get a sense for what what I should spend most time on.

B

So the the big thing is kind of giving a practical overview of data scaling challenges and, in particular, in the context of deep learning.

B

So hopefully that's what everybody's here for I'm gonna, like I, said I'm gonna sort of contain this in the context of a research study that we did so that you can get some ideas about learning, algorithm dynamics and help build up some intuition so that when you sit down and you have to start training models, something you have ideas for directions to go and I'm gonna be highlighting throughout the talk, some vocabulary. That will be useful for you to refer to.

B

If you want to start looking things up online, some of this stuff, it's hard to cover in this context, it's easier to actually experience running into these problems, and so just having the vocabulary and finding the things is, can be tricky.

B

Okay, so I'm gonna do an informal poll, because I would like to learn some more about who I'm talking to so I can not waste time on things everybody knows and I can focus on interesting things. Then I'll go into the this research study that we did it by do.

B

There are some practicalities that I've noted that I think are valuable to people that are coming into this. Just in training like the the first things to note, when you're trying to do deep learning, training I'll go through those and then really what I'd like to get at and kind of the meat of this talk, hopefully, is we we are going to have datasets.

B

Hopefully, the datasets are going to be telling us things and we want to design our models such that they can find information in the data and so model architecture search will be the the topic, but really what we're trying to do is figure out what data is telling us and then I want to get at how does accuracy scale? How do you go to like if you want to train these things on large-scale systems?

B

How how best to do that? There are some rules of thumb, helpful tips there. So let me ask a few polls that I'm gonna assume sort of table stakes that people in here our scientists, I, think I saw a poll you have datasets, you probably have some compute power. You probably have existing models. Is that are those safe assumptions?

B

Okay, cool, so I want to know what what we are aiming for. So we're we're here, learning about deep learning, but why is it that we're trying to learn about deep learning? Is it because people are sort of looking for novel research insights? You maybe raise your hand if you're, if you're looking for okay, so lots of researchers here. So how about validating or verifying existing models?

B

Maybe okay, and how about product-oriented, prove improvements? Does anybody working on product things couple? Okay? Are there other things that people are planning to do and finding to use deep learning for Thorsten might have the answers for some people that I've done a little bit of that. Thurston has definitely taken that to the extreme I will cover a little bit and then can people tell me what tools that they currently use so that I can know what to kind of gloss over if necessary.

B

Are people using sort of older tools like C, C++, Fortran, blast, libraries, a few okay and how about sort of early frameworks like our our numpy, some okay, a lot of him familiar cool, that's that's good to know and how many people are already using deep learning frameworks, nice, okay, so there's a lot of us excellent alright.

B

So let me like jump in I'll, describe this research study that we ran set up sort of the context for some of the tips and tricks that I that I'll describe later so when I started at Baidu, I was trying to get my hands dirty with some of these existing models. This was kind of after the the craze of computer vision where there were nice computer vision models that you could take off the shelf like rez nets, you could train them on large data sets.

B

I was speaking with a bunch of the machine learning researchers there and we had this sort of common observation among a lot of different application domains that there's this this view that, if you increase your training data set size, you get improved accuracy and so here's an example of one of those plots from Benko and brill's paper in 2001. This is on a task called word sense, disambiguation, it's like if you have a word. What what's the meaning of this word in the context that it's in and they had?

B

They tested four different models and on the horizontal axis here in millions of words, log scale, they increased the dataset size and trained these models on different shard sizes of the data, and they showed that the accuracy improved.

B

The interesting thing here was that this was consistent across all the models. If I increase the data set, size, I get improvements in accuracy and at the time there there wasn't this sort of kind of unbridled sense that this was the case.

B

It's it felt like some techniques that we had didn't really work, but the techniques that they tested here did follow these trends and because of the sort of history of this there wasn't it wasn't clear whether we should go after is: is data set scaling the thing that really matters or is it actually the models? And so the plot here suggested to them? Well, there's a different model: that's the best model at each data, sighs here so the, for instance, the window.

B

The X here is the worst model at the small data set size and it's actually the best one at the largest data set size, and so there was a lot of work following that. That kind of focused on the fact that these models, a lot of the accuracy, are getting, is model dependent.

B

We need to do model, engineering and so I'll get to some of that later. In the talk, it's very important to have people doing model engineering, but we still have those trends that more data helps.

B

So, given these observations, we had a few hypotheses that we wanted to test. Maybe all applications have the same or similar trends if they had the same trends, that'd be kind of nice. So then I could say you know in this application domain. I have this trend in this other application domain. I can expect it to be the same.

B

Maybe all models also follow the same trends and I think the the characteristics you find in deep learning actually suggests that, as you get more sophisticated models you can. You can eliminate some of the impediments to learning that we've had with sort of traditional machine learning techniques, and so some of that's probably been covered here already and then for a machine learning, research or engineer you have this sort of conflict, I guess about. If data it gets me better accuracy and that's all I need to do that kind of sucks.

B

Like surely surely, there's something more I can do as a machine learning researcher engineer. So we went in and kind of dug around to look at what the what what is theory tell us about how accuracy should scale with data, and so I should note here that what I'm referring to is generalization error. Everybody turn familiar with the term generalization error. It's how well a model generalizes to unseen data elements.

B

Okay, so theory, if you look through a bunch of the different papers on the scaling of generalization error with dataset size, you get a power law reduction in error, so here now I'm using error instead of accuracy. So this is going to decline.

B

Epsilon is the air. M is the number of training samples and then alpha, beta and gamma are some constants here. Well, the basic idea is that error should scale as some constant times your data set size to some power, plus some other constant, and this this the exponent here beta generally in these papers, is minus 1/2 and.

B

What is this thing? It's saying that as I grow, my data set size. I have this nice term in the in the denominator of my thing: that's the square root of the data set size. My error should decline on a square root.

B

Alpha gamma, alpha and gamma are constants that are sort of related to the problem. They capture things like the model characteristics. Things like inductive, biases and gamma is something that captures sort of a minimum error. So if you have stochasticity in your data, there's a minimum error that I can reach. I will go through these in detail, with some plots going forward, but anyway, theory is telling us that we should expect sort of this. This minus 1/2 exponent in this power law.

B

So the large-scale study that we were going to run was two tests. How does error scale for a bunch of different applications using a bunch of different models, and so we started with didn't start with machine translation, but this is the one where we finally I think hid the methodology correctly in nmt. So the the task with neural machine translation is it's a text to text task I'm, taking in strings in one language and I'm generating strings in another language. So this is like Google Translate.

B

We picked a particular dataset and we were gonna try to scale the data set in chunk sizes upward, and so we saw a. We saw results that look like this. So this is for two particular model sizes.

B

One is 208, hidden dimensions, hidden weights in each layer, so it's a sort of narrow graph, a narrow model, and then we also increased the model size to 512, hidden dimension and what we see is kind of this sort of predictable power law, plus some constant, and so the the example here shows that this small model has a high gamma. This is suggesting that there's something going on that's impeding its ability to Train any farther than this error, and but but what was sort of interesting to find here is the beta.

B

The thing that we were sort of relying on for scaling accuracy with data is not equal to this minus 1/2. That was expected in the theory. Instead, it's it's a smaller exponent, smaller in magnitude, and so what does that mean? It basically means that I'm not scaling as well as theoretical and as I increase my dataset size, any questions on this so far. So there are a lot of factors in this whole learning curve collection that are sort of tricky to to decouple we've done.

B

We did a bunch of ablation studies, sort of after the paper that we put out about it and I do have some of that. Insight in here hopefully get at some of those questions, and maybe ask that again later, all right. So in practice, though, I'm not going to just train a single model size on every dataset size as I grow, the dataset size I probably want to grow the models also, and so, and when I'm going to deploy a model say I was going to take a model to production, a production scenario.

B

What I'm going to do is I'm going to train a bunch of models and I'm gonna choose the ones that generalize the best and so without changing sort of the model. Family I'm just changing hyper parameters, tuning the learning rates, things like that. We get a different curve, so here I'm picking the best the best model at each data set size. So here's a small data set I'll still natural, sorry machine translation, the the other models that I was showing were head scores that were above this.

B

This is the point that was the best model at this size and what we see now is if we were to compose the these curves together, we actually do see power law, even without a constant like this fits very well for small data set sizes. This is I'll describe what this is a little bit more detail here, but M being the number of training samples. The the exponent here beta is actually much worse than what we'd expect from theory. So this is. This is somewhat disappointing.

B

We would hope that we could scale very well as we increase data set size. So this is something where we want to look into a little bit more deeply, and then there are factors in here like these constants that we'll get at also.

B

So the sum total of these results, if you look across a bunch of different applications, is there are common characteristics in all of them that the learning curve so as I increased training data set size, generalization error will fall off, should decline, sort of predictably in these three sort of phases, the first is a small data region and what I mean by small data is I have so little data here that the best that I can do is sort of guess and when I say best guess error, that's almost precise definition.

B

The best I can do is kind of guess, the most likely output based on what the output distribution of my data looks like. So, if most words in the English language or the word, though, if I'm trying to predict the next word, though, is a good choice. So that's sort of what I mean by best guess error. As you continue increasing the data set size, I can start lifting out nuance in the data and I can start finding relationships from the inputs in predicting the outputs and I can fall into this power law region.

B

The power law region is, it turns out, is predictable and unfortunately, I have to collect this and predict it empirically, but once I get in that region, I can sort of predict what my accuracy should be at a larger, a set size and then, if you have so this is for all real applications. All real applications are going to have some irreducible error, that's nonzero!

B

What I mean here. Is there some stochasticity in the data things that I can't predict, no matter how much data, no matter, how many input features I have and so there's sort of a lower bound here. If, if we're in a kind of vacuum, you could expect this to go to zero error. If your error loss function is, has the ability to go to zero, this power law region is what we're going to sort of leverage and I'll talk about it in more detail in the rest of the talk, but it's it's predictable.

B

We can collect this empirically. Yes, so the the list of papers that I put up there has details on each of these and the proofs end up being very complicated. There. Isn't there isn't a good way to sort of summarize this and describe it in as simple a there are and then the one of the problems is. There are different assumptions that you put into calculating. So if you look at nonparametric, estimators nonparametric estimators can have different exponents in the denominator.

B

Anything that you're going to be targeting probably with deep learning, will have this minus 1/2 as the as the best case that you can get yes, so this is. This is tricky. It's tricky to decompose this. The major reasons that we've seen are things like visibility of all of the information in your inputs. So, for example, if you were to take a task where I have full visibility of the inputs, the the irreducible error should be 0.

B

I should be able to predict every output from the inputs if I start truncating that so I remove some dimensions of my input that are that contain information that allow me to sort of narrow down, subdivide the space and make more accurate predictions I'm going to lose accuracy.

B

There I'm gonna try to integrate across the other inputs to make a prediction for the dimensions that I can't see, and so then that's sort of an averaging function where I'm now averaging between a model that would have perfect visibility of all the inputs and the error is going to decline or it's gonna. It's gonna, get worse sorry, increase.

B

Okay, so the methodology that we we followed for this is nice. In a vacuum we shard the data set in to power to sizes we for each data set size, we train a bunch of models looking for kind of the best model, a best fit model plot. The results write the paper. This is rife with the spherical cow in a vacuum and so hopefully like the stuff that will come after this will show you a lot of the practicalities we ran into when we were trying to do this at scale.

B

So what I'm going to jump into now is the training process, so.

B

When you're setting up your tools, there are some things in training you've seen a lot of this probably experienced a lot of this already so I'm gonna try to hit things that are maybe unique from what other people have said, and also things that have been very valuable in our testing. I can skip over code because a lot of you have seen the kind of code that exists in frameworks.

B

Obviously, there are many useful tools available online, many pre-trained models and things. My maybe short story on this is when we were doing this study. I had a young engineer who was he was familiar with cafe from some work during school. He was trying to set up cafe, to run some tests for this large scale study and had been finding finding it difficult to set up the data pipeline and other things a couple days into it.

B

We sat down together and he said you know I'm struggling with these things and I said well, maybe maybe we could find another tool that would work online and in about 45 minutes we had a tensor flow model that was training and running at the scale that we were hoping for so be sort of. Mindful of this, you can go find nice tools.

B

Do things from off-the-shelf models as you're setting up to train the storage that you'll run into storage can be a challenge also, datasets that we were training on were actually small relative to a lot of things that people in here might be dealing with some of the language tasks. That I was looking at where only gigabytes of data set speech. Recognition were up two terabytes, but people in here maybe have have tens. Hundreds of terabytes of data and so storage is. These can be massive trying to get good input pipelines and things is tricky.

B

These are practicalities we have to deal with and we need to sort of make sure that our datasets are clean, make sure that they match things that people are already using.

B

If, if possible, just use off-the-shelf data sets for setting up to set baselines and match what previous people have done, and then this is also probably goes without saying, make sure you're taking checkpoints, while you're training, if you're, if you're a machine, happens to die, you don't want to have to rerun everything from scratch and then one thing that I've found is is useful for people who have maybe seen a little bit of they haven't had a whole lot of experience with frameworks. Yet is theirs.

B

They're sort of two different ways that frameworks are structured, one is, is called eager evaluation frameworks. The other is lazy, evaluation framework, so I'm, given that a lot of people in here have used things like MATLAB and are you know that and every time you execute a line of code, the data is ready for you. You can inspect the data, that's really nice for what mustafa was describing.

B

If you want to pull out a tensor and verify that the activations you have have some nice distribution, you can do that sort of line by line in the code and so eager evaluation frameworks. Do that MATLAB our PI torch, so these are easier to do this evaluation.

B

Trickier frameworks for this are used a thing called lazy evaluation. This is where you construct the graph first and the framework will build an internal representation. So instead of something like this, where the gray boxes actually hold my data instead, what I get is a graph that looks like this as an internal representation and I have references to the weights. I have references to the inputs and I need to use the I need to tell the framework propagate put these put this data in these locations. These handles and give me back results.

B

So we have to build the graph first and then we specify something like session dot run I want to fetch the value F and fetch it fetch the values for F, given that I'm feeding some values for X I, think everybody's familiar with this stuff right sure so tensorflow 2.0 is eager by default.

B

So one one benefit of doing lazy evaluation is that you can do some optimization before running session run and so that sometimes this can be faster. Okay and debugging is something that is always a challenge in in these things, you you want to be able to probe tensors, you want to be able to print values, I'm not going to go into much detail, but there are some great references online for ways that you do this.

B

Maybe we've seen some of this already in these talks, so the big takeaway here is just the difference between eager evaluation frameworks, frameworks that execute things sort of line by line in the code versus lazy evaluation or I'm constructing the graph and then running parts of the graph, okay, best practices for training.

B

So one of the challenges that we had when we were setting up these large-scale experiments was because we were running a bunch of these experiments in parallel doing a lot of hyper parameter search, we need to be able to sort of go back and identify which runs are giving me which data points in my set, and so it's maybe obvious, but keeping your configuration files with your checkpoints is a really handy thing to do. It's very painful when you find yourself in the situation or I. Have a checkpoint.

B

Id I know that that was the good model. I need to reproduce some results or running on a different validation set, but I don't have my parameters that I used for training I can't do fine tuning or something so keep your configurations around mustafa was talking about during looking at our training, curves and and comparing to validation, curves, and that we want to stop our training early. Stop our training before the model starts overfitting, and this is where we start to see.

B

The validation curve diverge from the training curve, and so, when practicality, that you'll run into when you're running a large hyper parameter search is different. Model sizes are gonna train at different rates, that training curves are gonna, look different and setting. How frequently you run validation is a tricky thing, so your validation, that's or the practical thing you're you're working with here is your validation set, has to be some size. It has to be large enough to give you a statistically significant estimate of the divergence between the training and validation curves.

B

So I want it to be large enough, but I don't want it to be so large that I'm actually spending more time doing, validation than actually training and so picking the validation periods. How frequently you run validation is a little bit tricky I, don't want it to be too frequent to cause unnecessary computation, but I want to sort of pick a frequency where I'm sure that I won't miss the best model in there somewhere. I want to be very close to that best model that I've found.

B

This, unfortunately, is something you have to kind of do empirically and test a few models. Once you get the hang of it with some models, you know it will extend, so you can use this practically speaking it later. If you increase data, set sizes and increase models, you can kind of estimate what this should be. Okay,.

B

After training, there's there's a challenge, I'd say: I've selected a model on my validation set and now I want to test it on some sort of real-world data. My test set and there's there's a problem here called test train distribution mismatch. This is where the the test set has a different data distribution from the training set. Everybody familiar with these challenges already some, maybe some, not okay.

B

So when you're going into these problems, you want to have a strategy for how you would address something like test, train mismatch and, as an example, I heard this from some Amazon Alexa engineers that they had set up this nice training set. That was that was using data from like audio books and recorded audio in a sound booth, and so there was it was very clean audio and it was people reading from something it wasn't like a speaker up front telling you things.

B

So not only was it very clean audio, but it was kind of a weird contrived sample where, if I'm talking to alexa the vocabulary and things I'm going to use are different than if I'm just reading from from a book, and so when they sort of deployed this in a in a practical setting with a device that had a microphone in it. The microphone was catching ambient.

B

Noise echoes all kinds of weird things and the phrases that people are using to speak to this thing we're wrong, and so there was a really bad mismatch between their to their training distribution and the actual scenario deployment. I so Amazon picked a strategy here. It was kind of unique. They went off and tried to remove all the ambient noise and echoes by designing this really incredible microphone array into the device, and so that's something you can consider doing like punt.

B

The problem, you know kick the can down the road and try to pick the pick, your battles wisely and finally,.

B

Sometimes tricky to pick out evaluation metrics that make the most sense for the problems that we're working with. So as an example the when we were working on our speech generation systems that by do these, these were sort of a new set of systems where we're generating trying to generate audio waveform from a text string, and we want that audio waveform to sound natural.

B

In deep learning, we have a set of sort of lost functions. An l2 norm of spectrograms was one of the ones that we were using and when we use just the l2 norm to compare these things.

B

Sorry, the l2 norm is because it's a distance measure over over two images, like the difference of two spectrograms, it permitted small, perturbations or variants that cause things like static noise or robotic voices. So the edges of things were hard in the spectrograms and to fix this, we ended up kind of normalizing, a few different loss functions. We sum them together, weighted them and that helped with some of the issues and then eventually what we had to try to. What we decided to do is instead of just measuring error.

B

As something I can calculate in a framework, I'm gonna generate some samples and just kick those samples off to something like Mechanical Turk, so we use Mechanical Turk, put posted up our our generated audio waveforms and then said: ask users: what's your preference of these different versions, which one is the best, and so there there are evaluation. Metrics like mean opinion score, which you can't actually calculate inside of your framework during a training run and these these crop up in a lot of practical applications.

B

Questions on these okay, all right. So let me get at kind of the meat here now that we're, through some of the practical training set up I, wanted to call this section model, architecture search, but really what I think I want to the point I want to get across is that what we're doing is we're using models to give us information about data and data is really the thing that we're trying to analyze.

B

We've heard from other speakers deep models can struggle to disentangle representations. There are a lot of sort of impediments to this.

B

We want we, we could go and disk and build deep learning models that can model arbitrary structure. So, for example, there is a thing called the neural Turing machine and technically it's Turing complete. It should be able to compute any function, any continuous or sorry any real, valued function, something that doesn't doesn't have like decidability issues.

B

Unfortunately, the challenge that we have when we're doing when we're doing model architecture search, is it's hard to optimize some of these models, so the neural Turing machine was a nice theoretical nicety, something that was interesting to try play around with certain tasks. It was very hard to optimize, so it was it's we've drawn ideas, inspiration from it since then, but it's not a model that that we generally use in practice.

B

All of what I'm going to be going through here then, is focusing on how we, how do we limit the the challenges to training these models and the key insight is something about. What does the error surface? Look like how do I make my training process look like a convex process at least locally, and so I'm, going to flip the perspective start from the data and look at what the data is telling us design models to match, and so here the here's, an overview kind of the pieces that I'm going to describe here.

B

The first thing and we've seen this Mustafa was already sort of hinting at this. In his previous talk, the information content that exists in your real world data sets is very likely very hierarchical, and so the models we're going to be building are going to be trying to extract this hierarchy.

B

The models that you're training are going to need sufficient capacity to capture the information content. That's in the data set and so I'll talk about those learning. Curves also gave us some bet best fit model sizes, and so now we can analyze. What's the size of the best fit model, what is how does capacity grow as we increase data set size and what you're going to be doing if you're modifying some models to to improve their accuracy on your data set, is you're going to be engineering?

B

Biases biases are predispositions about how the data is structured and how the models should be representing those, in particular, I'll cover, inductive, biases, initialization and prior biases.

B

Ok, so data sets are usually usually contain, hierarchical relationships, so these are bottom-up structures. You're gonna be finding small bits in the early stages of the models. We've seen a little bit of this already and later deeper in the model, we're going to be capturing sort of higher-order concepts, and so a good mental model for what your data set looks like is something like an ontology and unfortunately, what deep learning results in what? What deep learning models learn is not always the same. Ontology is our intuition.

B

It just happens to be that there are ontology x' in the data, so sometimes we can have intuitions about how to change our model to help it learn the structure, but these intuitions might you might learn things in a different way than the sort of specific thing we were thinking of when we try to structure this.

B

The big takeaway here is, if you have sort of poor representations early in the model, capturing the little bits it's hard to to use that information later in the model, and so the model have sort of instability or error that gets introduced early on so Mustafa showed an example here. I'll spend a little bit more time on this.

B

This is from a very interesting talk. I'd highly recommend watching this talk because it kind of breaks down this. The hierarchy of data in the context of computer vision models, as I learned, Fergus, great great talk. They designed a thing called Adi convolutional network that allows them to sort of back out what filters later in the model are perceiving from an image and so here's your examples. This is what the image is perceiving after layer. 1- and here R, this is a set of images that sort of represent the.

B

These are patches of the images that are matched by these different filters. The top corner here looks like it's an edge detector on something, that's sort of north-south oriented, and so you see a bunch of these patches for in an image that are north-south oriented, so we're capturing that information about this edge here and some of these are color. Also so color is captured in these patches. Like the green here, as we move to later farther into the network, we see that the network is starting to compose these things.

B

So by deconvolution we can see that the structures in here are sort of repeated edges, sort of lattices and things and the model is actually capturing those, so it's edge detecting little bits and pieces, and then it's composing those edges together to get this lattice.

B

We also see things like barcodes in the middle of the network, so these are things that have still have fairly regular structure, but at a higher granularity than earlier in the network, as we get very deep in the model. So this was a I think used something like vgg Nets at towards the end of the model, we're capturing things that are more organic in nature, their compositions of a lot of different bits and pieces. So, for instance, capturing eyes of owls.

B

You've got a little bit of symmetry, but it's not as regular as things like lattices. So it takes a little bit more hierarchy to extract those things out. This makes sense. Okay, there's another example. Here, that's that's useful or helpful. One! Isn't it to kind of see how this works and that's in language modeling? So this is sort of my main area of research.

B

The the first layer of language models is are embedding layers, and here what you're trying to do is you're trying to map from a vocabulary of some sort into an internal representation that the model could use. So it's going to be looking up vectors. So here's like a load low, dimensional projection of a few of these vectors. Let's say like the corner here is zero zero. So man is positioned here. Woman is positioned here, etc, and, given this low dimensional representation, I can find the eigen value or something I can values.

B

Allow me to project like this I can notice that the orientation of sort of human between the male version and the female version has this vector, that's that could connect to them but sort of changing the gender of the embedding the word embedding from male to female, and you can see that it's common across a bunch of different words here, so man projects, the woman uncle Tuan, etc.

B

The embedding layers of these models captures sort of low dimensional relationships like this like gender, so other things that you might consider would be you know, taking a verb and making it an adverb that might be a projection in this low dimensional space. If we look at models like elmo, Burt and GPT to some of the more recent transformer models, more complicated models, they're going to use attention mechanisms to now combine these embedding representations and they'll. Do it somewhat hierarchically, and so this is a simplified example of what these things are doing.

B

But you could imagine if we had this sequence that looks like this. The King asks the blank if she would attend a party or something given the representation of King here I'm going to attend heavily. So the darker arrow here means a model is probably attending heavily on the word King, probably attending heavily on she its it needs to it's recognizing that the position of this object probably has a gender female.

B

It's also probably attending on the word, noting that this is an article, something also low dimensional, and so then it can use this attention layer to predict. Perhaps this word meat should be queen. It's changing the gender from king to female Queen. This is another example of hierarchy.

B

Questions on this. Yes,.

B

I have a bit of gender bias. Apparently sorry, the the.

B

Picture that I grabbed here was just from online, so that's that's. The only reason I could have gone. The other way or I could have I could have actually won. One also interesting thing is: if you look at the position of the word human here or person, I think one of these words it will land sort of in between these, and there will be a vector to the female gender and there will be a vector to the male gender. These are sort of analogies tasks that that embeddings can take take on yeah.

B

Yes right, so that's that's exactly how attention mechanisms are designed is it's. You, basically are taking a query. So the query in this case is the this blank position here. This question mark I have embeddings for each of these, except this, this one I'm embedding kind of a empty character and then on the output of this. What I'm doing is some relationship analysis clustering between these vectors, so I would take like the vector King and the article here this vector this.

B

These are probably related, and the attention mechanism would look at the those two vectors they do. Like a you know, a dot product between them. The dot product is gonna, say how closely related these things are, how important is Queen in this representation and then from the dot product. I'll do a softmax, so this will be turn it into a probability distribution.

B

How important is the word King in predicting Queen and you'll, see that this is very important, so there will be a vector after the softmax that has weights that tell me which positions are important. That makes sense.

B

That's maybe something that I I should should have expanded on here, because we saw I think there are a lot of content about computer vision models. Some of the new models, oh I, will say a little bit more about new models. Some of the new models that are being used very widely right now are called transformers and they're. Basically just sets of these attention mechanisms with without other things, without convolution.

B

Generally, the it's worth saying: generally attention mechanisms work well in things like sequences and things like text, you can use them in computer vision, applications or certain things in computer vision that are important like if I'm trying to steer a camera. For instance, I'd like to figure out where I should be attending and then start to try to set my reticle of the camera on that position. So you can use them in computer vision, but the applications aren't, as aren't as clear obvious.

B

Ok, so the there's a there's, an interesting kind of advanced paper. That's come out in in the last couple years, discussion in the last couple years on a thing called the information, bottleneck and tisch be kind of pioneered this here at berkeley.

B

The the idea is that, as I propagate through a network, the information content that I have at any stage I can. I can mess up the information content by narrowing my model. So if you, if you consider a model that was just a bunch of fully connected layers, I can have a dimension here that if I reduce the dimension of one of these fully connected layers, I'm reducing squashing, my information down into a low dimensional representation and because of that I can lose information.

B

Generally, we don't design models like that. We keep all the layers kind of the same width so that we don't squish squish out information, but the bottleneck is, is a really important concept. So, if I squished the inner dimension of the model at some points, the the I am bottlenecking the amount of information that can traverse through it. So in recent years, we've been looking at the information bottleneck in the context of deep learning, and we get some very interesting plots that look like this.

B

That also echo this this argument that data is hierarchical, so this plot here is showing, on the horizontal axis, the information between the mutual information between my input, X of the input to the network and then the activations at layer, T of the model and each one of the each one of these sort of lines. Here is a training process for each of the layers. So this is the first layer, the second layer, the third layer. So this is in depth through the model, so as I'm training, the information is changing.

B

What I'm gonna focus on here is actually the end points at the end of training. These yellow dots here are the so this is the first layer at the end of training. The mutual information with the input is actually kind of small, and so what is what it relative to the to the other layers here?

B

Why is it small, its small, because what it's doing is detecting small bits of the information in the input and then later on, the the further layers are going to try to combine that information with more information that they're extracting from the input, and so the information content relative to the input is increasing. It's going to the right as they get deeper in the model and then on.

B

The vertical axis here: I have the mutual information between the current layer, activations and the output of the model, and at the end, hopefully, these activations are capable of telling me what the function should be. What function I'm calculating it should be able to predict the outputs of the model, so hopefully the mutual information should the outputs is perfect in this case one.

B

So this is kind of an advanced topic for those that are curious, check that out yeah, so think of it. This way that what you're going to try to do is to carefully limit the information content, that's flowing through the network.

B

So if I need to reduce down my the dimension of my input to make a classification like Sam I'm, just saying if something is a cat or a dog in an image network at the at some point, I have to reduce the dimension of of my model down to a prediction, cat or dog, and so I can do that like right at the end, that might not be the best approach. It might actually be smarter to to kind of enforce that I'm slowly eliminating information by narrowing the model as it gets deeper. That makes sense.

B

B

Okay, so, finally, to kind of wrap back to the tests that we were running for this research study, what we found, which is sort of well-known but gives us the the sort of systematic study we did, gives us some some ways to decompose this to break it apart. Finite data sets have a finite amount of information which is actually sort of convenient, because that means I can I should be able to represent them with a finite model.

B

Models with sufficient capacity, which I'll define in a little bit we'll fit a full data set, and so what I'm plotting here is examples of what is sufficient capacity or what is sufficient number of parameters in a model to fit a data set. So here what I have is this is for word, language modeling. We had particular data sets and we were looking at training sets that were chunks of this, so this is sort of powers of two roughly different sizes, and this is the validation loss.

B

So this is how well these models generalizes I trained them. This small data set has sort of a finite amount of information that a small models of model. That's this is model size number Prem here model size of what maybe 200,000 parameters is sufficient to fit the training set here and anything that any larger model here would have capacity to over fit so I don't actually need a larger model.

B

This this model here is the first one that has sufficient capacity to fit the data set. Then, as I increase, the data set size. The information content hopefully increases in your data set.

B

So this gets maybe to some something that Mustafa pointed out early on the it's possible to grow your data set size by just taking another copy of your data set and concatenating it to the one you currently have there's no new information there, because I already have all of that information in previous samples, and so you have to you sort of have to be careful about what it means to grow. A data set.

B

What we really want to do is make sure that we're growing the information content in our data set, and so this this is sort of captured here or subdividing our data set into different chunks sizes and then looking at the model size that fits and I'll get to in a little bit. How should model size grow as we increase data set size, larger data sets require larger models and there is sort of a theoretical background, for that would help us predict how large models should be for a given data set size.

B

If we have some point of reference and a smaller data set and that's model capacity, measures like the VC dimension, so VC dimension is an upper bound measure of capacity. The rigorous definition is a little clunky. It's it's given the model, it's the largest set of points that can have arbitrary labeling that you can shatter, meaning you can split and distinguish all of the separate points.

B

The definition here is not so important as kind of what this means conceptually when we're changing models. If you go, look at some prior techniques, try like sort of traditional machine learning techniques things so traditional machine learning techniques have some limitations on how capacity grows. It's it's hard to grow the capacity of some of these techniques, and so, for example, decision trees. Have capacity is something like order.

B

N log D, where n is the number of leaves in the tree and D is the number of dimensions of each data sample, and so what does this mean? It means that sort of the size of the tree, the size of the tree, is actually probably n, plus a log of n. The size of the tree grows roughly linearly with the capacity of the model, and so this isn't a very this.

B

Isn't that we'd like to have some compression factor on our data we'd like we'd like to be able to represent it with less with a structure? That's smaller, it grows more slowly than the data and that's where deep learning is really deep. Learning is really nice here, so Bartlett in 2017 showed that deep neural networks with nonlinearities have capacity WL, log, W and here W is the number of weights in your model, and so this is actually proportional to the size.

B

In this case, and L is roughly the depth of the model, it's an interesting way of characterizing depth of the model, and so the the what's really interesting about this is now I. Have this sort of trade off between capacity of the model and depth but I also have this really nice factor? I have W log W here, meaning the the capacity of deep neural networks? Actually grows super linearly with the number of parameters the amount of storage I require to sorry the they grow super linearly in the storage.

B

The capacity grows super linearly in the storage. So are there questions on this? They used like multi-layer, perceptron models and rel. U nonlinearities.

B

So this is not exactly what you're, seeing if you're looking at computer vision models which have a lot of convolution you're doing some interesting ablation of the fully connectedness of this kind of model when you're doing computer vision things okay. So now, if I have this nice trend, if I can maybe rely on this, as these are tight bounds on this type of network, if I can rely on this trend, that capacity grows super linearly in the number of weights, I can start to predict, perhaps how the model size should grow with my dataset size.

B

So, let's start with the sort of simplifying assumption suppose the information content of my data set grows linearly in its size, which is sort of a reasonable assumption. I'm going to pick out new samples, I'm not going to try to repeat old samples, and so hopefully I'm giving them the data set more information as I grow it and so model. Given that capacity grows super linearly in weights model, our model sizes should need to grow sub linearly.

B

In the data set size, which is a nice property, it should hopefully be less than linear, and if you, if you do some back of the envelope calculations and some approximations here, it should probably grow greater than square root in size and in fact, what we found when we were doing our studies. This was true across all of the applications that we tested five different domains.

B

We do see sublinear and sort of power law, looking scaling of model size to best fit the data set sizes. So here what I'm, showing is rez nets, training on chunks of image net, where I have two images per class, eight images, etc.

B

This is the sort of a trend fit of the curve power law trend, fit that shows that ResNet scale approximately with 0.57 exponent. So if I am going to double my data set size, I should take about a square root of two increase in my model, size, which is a really nice property when you're, when you want to start scaling.

B

This is a is a very important point. So I want to make sure that everybody. If you have questions on this, let me know yes.

B

B

Right so the a recommended approach now get to a little bit of this later to is deep. Learning is a little bit. Fungible malleable, like over parameterizing model, is often better than under parameterizing a model so d.

B

If you want a scale- and you don't know exactly what exponent to use here, you can err on the side of more linear growth, so we've seen the bias-variance tradeoff I'm gonna reflect on it again because it gets into kind of some of the things that deep learning does like I just said, deep learning is a little bit fungible on on parameters. How you grow models so given mean square error, is the bias squared plus the variance plus some noise, historically and sort of traditional machine learning domains.

B

People said we need to optimize the bias, because that's the squared term, that's the thing I want to minimize in practice with deep learning. There isn't really always a trade-off here. There's it isn't. The bias-variance tradeoff doesn't doesn't really come into play. It's so as an example, we can over parameterize our model and a lot of models are good kind of ignoring weights. That aren't important and they they're still there but they're not contributing much to the calculation, and we can do things like regularization.

B

Regularization techniques can force those weights to not do anything, and so there is kind of a heated debate in the deep learning community, beginning in 2017 until early 2018 about why do deep, neural networks actually generalize when they're so over parameterised- and this is something that is if you want to go- read some heavy research papers. This is a place where you can get some very deep insights, very deep understanding of generalization.

B

By going through some of these works, some of the findings were I.

B

Don't know they were a little bit over the top, and some of them were actually incorrect and then corrected in following papers, but there's a lot of really interesting insights here.

B

Ok, so what we want to do is we want to engineer biases into our model and so I'm going to talk through a few of these different biases. So inductive bias is sort of conceptually speaking. I want my model to be able to deduce something I want it. I want to tell it how to make deductions, and so we call that inductive bias and we're gonna bake in these assumptions into our models. This is a sort of general process, so there are some examples. You've already seen and I'm gonna try to extend those examples.

B

So you have a bunch of different keys and sort of triangulation points to go back to. We talked about how computer vision applications probably should be translationally invariant, and so we we build our models out of convolutions, where I'm doing applying the filter at every pixel in the in the image, so I should be able to detect an edge anywhere.

B

The second example is languages, structured sequentially. So what does that mean? The words that are coming out of my mouth right are dependent on things that I've previously been saying, and so I want to condition what I'm saying now. On the previous words. Basically, all language is structured like this, their previous conditions, and so we want to use models that sort of integrate this knowledge. We want to use sort of Bayes, rule and kind of decompose.

B

These conditional probabilities through time in a sequence and so I think there will be a talk, maybe later about sequential models, recurrent models and so you'll see the the details of those models later. But we put this inductive bias into those models and then the more recent thing that one of the recent set of models that people are using our transformer attention models, and this is based on so we now have this conceptual view that everything we're building all of the data sets were going after or have hierarchy hierarchical structure in them.

B

So I want to be able to kind of arbitrarily combine different bits of information from within this hierarchy.

B

One way to do that is actually human memory is structured as a highly associative memory, where I have some concept in my head and it Prime's me to think of some other concept. That's highly related, so I'd like to put in associative structures in my models, and so things like attention mechanisms can do that and in particular, hash maps are or key value stores are associative memories that look like that and they can be used to cache these models, and so I can use attention mechanisms as a probabilistic associative memory.

B

That makes sense, and so I want to make sort of a bold claim here, but and and take this with a grain of salt. If, if you are a data data structures, expert, there's, probably a need for your data structures in probabilistic models like deep learning, and so if you--if there's a data structure that you think would help transforming from one representation in your model to the next one. If you can come up with a probabilistic version of that data structure that can be used in your model and it might actually help with training.

B

Does that make sense, cool.

B

I'm going to touch on this just briefly: inductive bias effects the these learning curves. That I was talking about previously, and so we tested a handful of different models to look at. How does how do different models with their different inductive biases? How does that affect their learning curve? So maybe this one would be the interesting one to talk about in speech recognition.

B

We train models called deep speech to which is a recurrent model at from baidu, and then we also put together an attention model also recurrent, but it uses attention to attend to different positions in the in the sequence and we looked at the learning curves. So this again shows this power law characteristic speech. Recognition happens to be one with a fairly good exponent here word: language modeling is really bad. This one's hard.

B

You have to use a lot of data in the case of speech recognition, if you talk to machine learning, researchers that do speech, recognition, work and they have experience with deep speech, and you ask them how our attention models. How well do they do every one of them will complain about how hard it is to train them.

B

You need a lot of samples to train attention models on speech, recognition and this gap kind of shows that the inductive bias that I'm changing here is this deep speech to model sort of assumed that my outputs have an independence assumption and in the attention model, I'm saying no, no they're they're dependent I want to kind of collapse some of my predictions using the attention and it turns out that training something to make that assumption to use the assumption that it's their dependent is tricky.

B

It requires more data, so, in practice, what we did is we actually just use a different model, a language model to help the speech recognition, attention model, formulate language, the language model. It gives it some predictions about what words should come next and if you actually plotted this, we took out the the language model. If you put the language model back in you see this come down and it does better than the deep speech to model.

B

um Oh I can skip over this okay.

A

B

Mustafa talked about data augmentation, but maybe it's worth noting that data augmentation is another form of inductive bias that you're trying to impose on this model that you're training. So if, if we think that our model should have invariance is like cropping, translating and rotating and we're building components of the model, that should be able to recognize that I could it'd be nice to have samples in my data set. That kind of encourage it to learn that, and that can be helpful for extracting out these hierarchies.

B

One other one: that's that's really important and I want to show practically speaking. What what happens in this scenario is how you initialize models when you so initialization is, is a bias on what you think the models prior should be so prior probability is: if I was just predicting the outputs and I just had the output data set. How likely are these things? That's the the probability I'm going to choose is the you know how frequent the one of the outputs is seem.

B

You can consider this at each level of your network and initialize different weights in different ways, and so I'll give an example of like a language model where I'm initializing, my word embeddings, let's say I have sorry, let's say: I have a vocabulary, that's very large tenth, a hundred thousand words and I'm I'm, embedding those into a 256 dimensional space, an observation and sort of an a bias prior bias that I have about words themselves, or most words are not related to most other words.

B

I'd like their vectors they're, embedding vectors to be very far apart. Have no relationship so like the dot product between them should be zero. They I'd like them to be, or mostly orthogonal, so one way to initialize my weights is I. Can initialize weights in a uniform, random way?

B

That kind of gives me this this characteristic that most of the embeddings are going to be orthogonal now I, say I, say most because in this case my vocabulary, size is very large relative to the hidden dimension and I can only have 256 orthogonal vectors in in a latent space. That's 256 dimensions, so when I uniformly randomly initialize I get a lot of orthogonality, which is good, but my model now is going to have sort of spurious relationships between these words.

B

Some of the vectors are actually going to have a nonzero cosign between them and so the model when I'm training it actually kind of needs to unlearn these problems these these relationships. So it's worth kind of thinking through this in in examples like this, when you're doing initialization. What is it that this sort of initialization causes, and is this kind of the prior? That I think is important at this part of the network and then in language modeling. We do something that Mustafa was mentioning and transfer learning.

B

We we initialize with pre trained word embeddings, because this already tells us that you know king and queen are related through a gender relationship, just really nice. So if you initialize with pre trained vectors training is faster you you could train the embeddings on a much larger data set than than the thing you're training on later and so training the full model. You might not be able to learn all of those embeddings very accurately. That's a that's! A phenomenally, interesting research direction.

B

There are people are that are continually working on embeddings right now and specifically for large vocabularies, but yeah. So the I mean kind of a rule of thumb in the in the community. Right now is you're. Embedding dimension in language modeling you're, embedding dimension could be like roughly the square root of your vocabulary, size and the maybe the intuition there is in language, modeling and using multiple vectors to represent things, and often there are combinations or sets of words.

B

That pair, when I have those sets of words together means something in language, modeling, and so one vector two vectors actually is enough to fully represent 256 squared is not enough. It's enough to represent a hundred thousand words, so the pair of words is something that I can get away with.

B

So there yeah intuitions like that, are really important. It's it's something like if you can extracted those from prior works like. Why is why are people continually picking this hidden dimension? For this thing?

B

Getting at that question is really important.

B

All right. So are there any questions on kind of getting at what the data is telling us and seeing how we design our models for the data all right, so yeah, I'm, gonna I'll, go through these things quickly, but in the in the large research study that I that we ran the trickiest thing, it was easy to scale back our datasets and run on small training sets. It was much harder to scale that back up and and to get to scale, I'm gonna. Let future speakers I think cover some of this material.

B

But there are a couple things practical challenges that you it's useful to be aware of, because it's nice to train on small data sets okay, so as an example for image classification, when we were training our models for this research study, we started with a model and a so we started with a code base data set and a pre trained model which could give us accuracy that was state of the art at the time. So this is a learning curve. Again, this is the dotted line.

B

Here is the learning curve I'm not sure if I showed the learning curve for image classification, but this is what it looks like here's, the small data region, here's the power law region, the small data sets are sorry. The large data set the large model we had trained correctly to the right accuracy, and now we wanted to start kind of backing this off ablating the model, reducing the data set size and see if we could to sketch out this curve, and the reason that you'd want to do this in practice.

B

For a lot of us in here is I'd prefer to pick a dataset size that might be sort of towards the the beginning of this power law region. Do all my iteration. He, because this is a small data set small model sizes. I can run lots of training, runs very quickly down here I. We would really like to be working in this region and then, at a later time, when everything works here run down this curve right.

B

So as we were kind of filling in points going backwards here, we actually saw a curve that looked like this, which was very problematic this this guy, so that the procedure that we were using was decrease. The data set size in steps decrease, the model sizes were going to smaller sizes and then I think we were being conservative here and keep everything else fixed, and so we got this thing that we we termed as the small data gap I haven't I haven't written about this publicly.

B

So this is maybe the first public description of this, and we had this really problematic challenge. Why is this happening? Does anybody have ideas why this might be happening.

B

That's the start of it, so the model size I need I need a smaller model size I, you know, I can use a smaller model anyway, so but smaller models have different characteristics about them that actually become harder to optimize. Do people know what the those challenges are.

B

Okay, so neither did we so fair point. It turned out the batch sizes were too large for these small models. So, if you think about it, the smaller model has a smaller latent dimension. I have fewer parameters to represent a transformation that I'm making and at any point in my optimization process, and what that leads to is smaller models have lumpier error surfaces once you learn something in this latent dimension, and you need to try to learn something else.

B

I have to sort of compete with the things that I've already learned, and so what I mean by a lumpy error surface is: if I continue training in one direction. On my air surface, it's likely I will run into non convexity. My error will get worse, and this is what exactly what was happening and especially when we were using batch sizes that were too large batch gradient descent.

B

So because we're averaging over a bunch of of samples with a large batch, the the averaged gradient, is something that's not actually a gradient from any one of the samples and so going in that direction might do more harm than good. It might cause the model to unlearn things that it had previously learned.

B

So there's some work on this, and maybe Thorsten can cover this a little bit more detail, but there's a paper that describes that batch size can grow linearly roughly linearly with the data set size under a handful of assumptions about how your data is distributed. This turned out to be very handy, and we empirically vetted this result when we were doing our study that it was it worked best if we decreased the batch size linearly when we started seeing something like this. So starting at this point decrease the batch size linearly.

B

Questions on this okay.

B

Maybe just say a few words here, so I've told you that we can predict the accuracy when I use a larger data set size I've described that I can predict the model size that will best fit my data set size and, given these two things and and let's say, I- have an accuracy target for this training process that I'm going after say it say: I have a product and I need to have a certain level of accuracy for it to be good enough to go into production.

B

Because of this, I can start backing out a whole bunch of different things. I can I can make a prediction given an accuracy level. I can predict my data cost. So if I'm added particularly particular accuracy level now and I'm going to scale my data set size and it costs some amount per sample, I can estimate how much it would cost to grow. My data set size public data sets are very helpful and there are ways to maybe get through this without having so like.

B

If you have labeling costs or something there are ways to do, unlabeled training and use those weights trained weights, you can also estimate compute requirements, so I described that the.

B

Dataset size grows, a certain amount model size grows, a certain amount. My computer operations, compute operations per parameter in models is roughly linear and the number of training steps that I need to Train on a larger data set grows roughly linearly in the data set size, and so now I can estimate that the compute time that I have grows a little bit less than the square of the data set size. Growth to larger sets. Okay and I. We have a paper on this also.

B

And I will skip I'll uh I'll call it here and jump to my summary.

B

I guess you can read it error scales, predictably, with data set size model, size scales. Predictably, hopefully I gave some ideas and insights about how we can impose biases in our models and train to larger data sets any questions.