National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 01 - Introduction to Machine Learning - Brenda Ng

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Good morning, all welcome to Berkeley Lab welcome to the deep learning science summer school. We are really glad to have you here, thanks for waking up right and early on on a Monday morning. So you know, I did want to acknowledge early on that deep learning, of course, has been taking off with awesome years and over the last five years, we've really seen deep learning for science take off the reason that you know we have 150 of you here pretty much.

A

The room is that capacity is because we all feel there's a lot of promise in applying deep learning techniques to scientific process. You can read a lot of generic introductions to deep learning and machine learning on the web, but really there is no definite of resource where you can turn to 40 planning for science material.

A

So this is really a brainchild of most of our I think about a year ago, he felt that there was a need, a gap in the community to create a targeted event wherein we could really go into depth on what would it take to get deep learning work before for scientific applications? And that's the reason why this summer school exists so today we're gonna, kick things off with Brenda Inge, so Brenda is.

A

Is the group lead for machine learning at Lawrence, Livermore, National, Lab and she's gonna walk us through a basic introduction to machine learning and we'll be diving into an introduction to deep learning later on and day so Brenda welcome.

B

When Mustafa told me to do an intro to machine learning, it's kind of like a challenge, because machine learning and intro to machine learning sound like a fashion show in that every one of us have different ages.

B

When we were going through back to school, we would have different exposure to the algorithm dujour of machine learning, and so, as a result, I decided that, because there's only 90 minutes in this lecture, I'm gonna just focus on more of a deep learning perspective of machine learning, so I'm still going to cover most of the basics of machine learning, but once I get down to more the models, I'm going to talk more about the specific, deep learning models that can achieve some of the different types of learning that I'm going to talk about so I'm doing a lecture style of presentation.

B

So these are the learning objectives in particular. Hopefully, at the end of my talk, you guys will be able you guys and ladies will be able to answer these questions like what is machine learning. What's the relationship between deep learning and machine learning and AI, and all that good stuff, you guys, can we okay?

B

So what does the term machine learning even come from, so an actor came from 1959 when Samuel Arthur actually coined this term machine learning as the field of study that gives computers the ability to learn without being explicitly programmed, but that's kind of like kind of I mean I'm, sure he's a great guy, but it's kind of wishy-washy and not too concrete and so come 1998 Tom Mitchell, another American ml researcher decided to put more umph into the definition by hosing it as a well-posed learning problem, and he said, machine learning ought to be defined as a computer program that can learn from experience with respect to some task and some performance measure yeah, and so that it can improve with experience.

B

And so with these definitions again, look at these days, they're still kind of like in the early days of deep learning. But already these definitions kind of firm up what researchers like yourself, are going to kind of research and perhaps aim you're in your curiosity towards, and so let me get into the relationship between AI ml and DL, because sometimes there's a lot of confusion. It's actually.

B

First, we have computer science and within the field of computer science, with artificial intelligence and essentially artificial intelligence is the engineering of intelligent machine that could kind of like humans, and it has its roots back in 1950s and then within artificial intelligence. So, back in those days, I still remember, there were like, like propositional logic and all that stuff, but um nonetheless there still needs to be some knowledge rules that need to be encoded in these prepositions and so how machine learning in like the 1980s?

B

Where is it possible just by giving the Machine just examples alone so that they could flow? The machine can extract knowledge without explicitly programming such rules, and so deep learning is yet another subset of machine learning whereby it is machine learning, but it is using neural networks as the vehicle for the mathematical models that we do. This machine learning, and so it has really taken off since nineteen twenties and and even now.

B

Oh sorry, 2010 is really taking off since 2010, and it's really proliferation but proliferating, even even right now so I want to give you guys, like a perspective of like you know, from from a layman point of view, like so I gave you these sets of Venn diagrams, ok, but I want to kind of motivate by artificial intelligence is really like. Well, this whole progress from artificial intelligence and machine learning to deep learning is really driven by our humans.

B

Laziness, if you will so artificial intelligence is essentially that perhaps you have a job. That's super tedious and you really don't want to do it, but it's kind of it's still, not so easy that you can just you know, get a robot to do so, and it requires some troubleshooting and whatnot. So is there a way that perhaps you can, you know, write a script to do it so artificial intelligence? Pretty much is motivated by the fact that you don't want to do it. You want to train a machine to do it now.

B

Machine learning is the motivation now ok, granted I don't want to do it I want to train someone else to do it this machine, but is it possible that you know not writing down these rules or these conditions that I want this task to be done? Is it possible that I could just give it a whole bunch of examples whereby the examples are kind of kind of processed, in a way that it highlights the important features of the problem, and so that's machine learning?

B

Now deep learning is like even another level of laziness where, like oh I, don't even want to do any feature: engineering, it's just I, don't know how to or it's just too annoying. So is it possible for me to give tons of examples to tuna machine effort to just learn the important features by themselves? So essentially it's a progression of laziness, but nonetheless it productive laziness, because here we are in this revolution of deep learning.

B

So there's some history again to motivate that artificial intelligence is really like. You know from the fifties and then machine learning took off, but then deep learning is a relatively new field, and so you know one should not kind of use, steep learning and AI in a synonymous sense. It's true that deep learning is the ML algorithm du jour, but it doesn't kind of supplant all of like artificial intelligence and other machine learning algorithms.

B

So now they give you kind of a really quick overview of the relationship between AI ml and DL I'm gonna get into the workflow, and this workflow is gonna, be a bit more detailed and probably one that you're used to because I want to give you a sense of how you would actually do it. If you were to kind of go home tonight, super motivated to to train an ml model.

B

Okay, so generally, in a problem we have inputs and these inputs might be. You know from our experiments, so we have some inputs or knobs that we can. You can turn and then, when we run our experiments generally, we get some kind of target, and so that's x and y, and so I'm I'm gonna talk about a machine learning workflow in a supervised learning sense where we actually from an input.

B

We do get the labels or targets, which is why now in the in the olden days before deep learning, we have to do feature engineering and the reason is, for example, if you are taking pictures of everybody here and we want to train a face, like recognition, algorithm, we'll probably have to engineer some features, like maybe shape of the eyes or width of the nose and other. So that's like before deep dirt, like someone actually with a lot of subject matter expert has to use that knowledge and encode it into this F function.

B

That will take your raw input, which is X and then transform it into features of your input.

B

So once we transformed it, then, essentially now we are done with with the feature extraction of the data, and now we are ready to have the D tilde, which is the set of the process inputs and target ready to go for training.

B

Now before we train generally, we have to split our data into three partitions and partitions mean. Essentially, there are three destroying sets and generally I do I do a rule where is like 80 10, 10, meaning 80 put so if I have my data, I would generally partition 80% of that into training data, 10% of validation and unattended to test set and think of it as like.

B

Training data is like when you're studying for an exam like a calculus exam training data is when you're reading. Through your notes, your your your book and you're looking through the work, worked out examples because immediately, as you see the question, the X, you also see the Y like in the same kind of in the same place so think of it as s. You've worked out like examples just for your studying and that's for training of the model, but validation data.

B

It's kind of like the examples that you're supposed to try out and only you should try it out first, before looking at the answer, key nutbag, that is to test how well you actually know the knowledge from the information gleaned from the training data and then test data clearly is like kind of the examples that you would get in an exam like you. Do it and that's super important because you get graded on it. But generally you may not.

B

You know it's super poor for you to do well, but you may not know what your score is until well until later. So let's move those data sets over to the side so that we can talk more about what we're gonna do with these sets of data.

B

So I'm gonna take my training, data and I'm, going to train my algorithm and generally before I. Do that I need to decide what kind of algorithm or family of so when I see an ml algorithm, the M? It's really just a mapping and it's not any scarier than just a mathematical mapping between inputs and parameters to a predicted output such that their predicted output. You hope that is true and your parameters well enough.

B

Then it matches the targets that you give it and generally like, depending on what you know about your problem like, for example, if you know that your data follows a linear trend and you know you will probably use a linear linear of the question, but if it's more complicated and you might want to use a more nonlinear model, so it's those kind of insights that you need to use to choose your app and then generally we would initialize the parameter so in training.

B

Essentially, we are exposing the model to all the instances in our training sets, which is here, but then all the while we are trying to tune our parameters, but we're just data and that's why I've highlighted it as well. So and generally, how do we tune this parameter? Well, we turn this parameter based on some loss function that we that we pre specified.

B

So, for example, if you're predicting like housing prices, so a viable loss function might be the MSE, but if, for example, what you're predicting is like a class like, maybe is it a dog? Is it a cat on so on and so forth? Then you might want to use something that can handle categorical items which is the cross entropy loss, so there's different types of loss functions but generally again, based on what you know about your problem, you have to specify specify that a priori, but nonetheless, these two are the most popular ones.

B

So generally, because we have more than one training instance in our training set, we would iterate and because this is a deep learning summer school, we know that generally, when we train a neural network, we would use the iterative process call like stochastic, gradient descent or mini-batch versions of it. So essentially, that's the picture where and every iteration imagine that you have this unknown unknown loss function that you don't quite see slopes down, but nonetheless you're able to evaluate at each point given the theta. So at each iteration.

B

You have your parameters that theta and you have your training data, and so you apply that to estimate your target and from the target, then you can compute your loss, and so that's what these like kind of these balls represent. It's representing the loss of a specific instance and because we used ingrain in the sense.

B

This is the update rule that we see that we have a theta, which is the current parameter and it's being tuned by some learning some learning rate times the gradient, and we see that when is um when it's kind of like sloping down, it naturally sort of guides your parameters toward the optimal point, which is at the at the bottom of this surface.

B

And so, ideally, if you have your alpha kind of like too big, it might kind of wiggle around kind of bounce around or if it's through smoke it might take forever for it to go down to find the optimum. So again.

B

So, even though this is in introduction, but for those of you guys who are doing deep learning oftentimes, there are optimizers that do adjustable learning rates. So generally, if you guys are doing deep learning, you guys don't really have to worry about the learning rate, because you can use atom or other types of more advanced optimizers that can tune others.

B

But the idea is that, given your trainings, that you expose the instances of your training center to your algorithm and that's called a learning, epic and so say you expose the training data multiple times like multiple passes of this training data. Do you add, wear them and you're tuned in your theta? And it's with the hopes that you know you're, probably pretty close to you, know being the optimum because you've been, you know, tracking your loss and it's going down. Then let's say we freeze it.

B

How do we know how good the model is and again the model again is just a mathematical function where it takes X N and the theta, which now you're freezing to predict your Y, and so that's where the Oh before you get there. So when we see how good a model is then immediately concepts such as under fitting and overfitting irrelevant, so the idea is that so M is a function. Is a mapping from input two to the predicted targets?

B

Clearly we don't want something that is like so under fitting with such. Like you know, it's clearly not fitting the training examples, so it this one clearly needs to go back to the training, training, I, guess dojo to get more to get to get trained a bit more. But then here is like it's. It's trained, so much that it's just memorizing all of the training. Examples such that if I were to give it a new example like something that is not in the training. It's just this red point.

B

It was just like not do well what's over, and so what we really want is a middle ground where there's there's a well in machine learning, called AI is variance trade-off. So this is high bias, and this is high variance and we really want to kind of meet in the middle. For.

A

B

Fit and so how do we? How do we know whether I'm gonna get a good fit, and so that's what our validation data set comes in, and so you might be thinking like well yeah like finally, okay, like let's what we do with the validation data, and so the validation data is what we use to compute. The loss again so again recall that I froze I had um froze the parameters, because I've already done a lot of training.

B

I've exposed the algorithm to the instances of the training data to the point, um I think I'm, pretty happy with my favorite, so I froze that and that's what's being passed to here so now, having that model kind of frozen the parameters frozen, then I'm gonna use it to evaluate my validation data, and so what I would do is generally um even when I kind of do this back in office. It's super crucial for us to plot the losses, because they give you a sense of what is going on in your in your training.

B

So but, however, we know that generally we're doing again for deep learning or other methods oftentimes. If we're doing an iterative types of optimization algorithm, we kind of need to iterate this a couple of times. So previously I was iterating between the instances in the training data and now I'm kind of doing this in an iterative manner and every time I iterate, think of it as I am improving my theta. So each of the X here is one theta and as I'm training right, hopefully my theta.

B

If my training is runo right thing it should it should do better. So that's why we see that at least for the training data it's going down and it should go down because otherwise there's something wrong with your code, but the validation data. You see that at some point start to go up and you might be wondering like. Why is that? Well, the reason is when you train too hard: oh okay. So, ideally the optimum. Is you want to stop training?

B

You want to stop training when the validation loss it's at this bottom and the reason is B generally, when you split into training data validation, data and test data. You I mean I mentioned that you've spent two 80/10/10 kind of percentage, but usually when you split these data sets, you also want to make sure that they have the same distribution like meaning if it's I'm, continuous, so so again using the calculus example, if you guys have been studying like derivatives, that's your training data and it's suddenly like your practice.

B

Example is still kind of derivatives with some chain rule, but then on the test, I give you like I, don't know.

A

B

It like integrals and stuff like you would you would do well. Okay, imagine you guys didn't learn intervals yet, like you would do super horrible in it. So it is very crucial for us to kind of have these three splits kind of follow. Some kind of distribution maybe have the same support, even okay, so going back. This is the optimum point, because beyond that point we see that validation set is going up.

B

That tells me going back to the stunning example that means I've studied way too hard I'm, just like doing so poorly in the in the problem set example. So I I must stop. I must maybe I, don't know, get a coffee and just go back to my chill state of zero well as as low as possible validation laws.

B

However, imagine if you have stopped before the optimum point, then in a way you're still, you still have ways to kind of improve on your validation. You see that so that's what's called underfitting, and this picture you guys remember that this picture from like two slides ago or so usually when under fitting it corresponds to this scenario. Again.

B

Each of each one of this is of theta right, and this is a loss and I'm showing you that if you want to use that theta to instantiate your model, to do your prediction, then it might look something like this, which is not not too good.

B

But then, if you kind of train really like for many epics and not really watching, what's going on, as mentioned, you might be kind of just overtired yourself, with your with your studying and in the machine learning term, pretty much you are starting to memorize, like all your training examples that you're not really generalizing anymore and that's why you see this trend up to a certain point: generalization era. Sorry, um the validation error would go up and you really want to stop before it does and that's the case.

B

One is overfitting as shown so once we are completely happy, we have our happy optimum, then it's ready for test, and so the theta star is my notation for after we've gone through this process, it's and is converged and we're fairly happy that it's really reach a balanced model, which is a good balance between bias and variance like the nice nice model like so then we're going to pass these great parameters to the final model for the test for testing our test test data and then so generally, once we test it, then you know we would.

B

I mean we're happy with there in terms of its performance, then we have a model, that's ready to go that could be deployed in whatever problem that you guys might have. But if it's really not so great, then you kind of have to go back and troubleshoot and I'm not going to sugarcoat it machine learning types of troubleshooting could be very frustrating, sometimes like even with working with tensorflow code can I, okay, sorry Google people in the room or any other types of deep learning library. Sometimes um they might change.

B

Like the order of argument, so I don't know, like things could be very subtle that you may not notice unless you really check the sizes of all your sensors but I'm going too far any into implementation, but in general things that you can check is like well, do you have sufficient data generally? If you occur, let me go back to this car.

B

um What you would generally want to do is that you want to make sure that your training data as close to zero as much as possible like when it converged and if it doesn't that, tells you that I think a model doesn't have enough parameters like expressive power to solve the problem or that you may not have enough data, and so that's some of the like considerations there or remember.

B

This is not deep learning, even though I've been kind of mixing deep learning, like talked with you guys, we are using, like you know, old-school features, so maybe they're the wrong features for this problem, and so there could be a whole bunch of like issues that you might want to consider. If your test performance isn't up to what you expect so now that you guys are pretty well, hopefully you guys now have a good appreciation of what it takes to train our ML algorithm, I'm gonna, get into deep learning and then later I'm gonna contrast.

B

How deep learning is different from traditional machine learning? Oh I guess I'm doing it out, okay, so traditional machine learning, as I mentioned, has this like really tedious aspect of feature extraction and generally it requires expert knowledge about the problem in order to extract the right features yeah.

B

So, for example like if you are a real estate agent and you're, trying to predict housing prices, you might I mean if your real estate agent, you know that you know this, whether you're in a good school district or whether you're too close to the highway and sometimes I, don't know you too close to the highway. I. Don't know like how does this tend to be cheaper than the ones that are in like the cul de sac, okay, but yeah?

B

But what I'm saying is that, generally, if you are to kind of do this kind of traditional machine learning, you really need to understand the problem at hand and use that knowledge to craft your features. And then you can put it in. You know your favorite classification algorithm and then hopefully it will give you the desired output, which, in this instant is it is.

B

It is a car but with deep learning is kind of like end to end situation, and so you see how like this poor person, Amazon Turk, maybe is no longer needed, because we can just dump in raw images into the neural network and as part of its training, its able to learn hierarchical features. And so we don't have to do motor.

A

B

Extractor anymore, we can just do this end-to-end, and so that's why I like people really like deep learning, because, like as I mentioned, like you know, we have better things to do in life, then get features or design features. So um what is deep? Learning, let's get back, let's get you to basics, so the basics of deep learning is we start with the artificial neural?

B

Actually, deep learning comes from neural networks and neural networks are composed of these artificial neurons, and really these neurons is not like simulating how our wonderful neurons in our brains are but more like to inspire by there's simply a mathematical model, that's kind of inspired by the fact that, just like the regular biological neuron, we have synapses that take signals from neighboring neurons and comes in and interact with this neuron cell body and then out comes another output signal. That is, then the input to the next neuron down the chain.

B

So if you look at this, like you know- and most of you guys have probably seen this before, but nonetheless your input is something that comes from like the previous layer, the previous neuron, and then we have these parameters. These w's, essentially that you multiply with your input and then we also add a bias term which is the beat.

B

And then we pass this sum through a non-linearity F, and this is all together what's being passed out as the output signal, and so, if F it's just near function, then all we can learn are linear linear models. But it's because generally we choose F to be nonlinear, so that chaining a whole bunch of nonlinearities nonlinear function is what really gives the deep neural networks such expressive power, so yeah so wait. Empires are the parameter.

B

So previously I made a big deal about the Thetas, so in in neural networks, the W's and a piece are your parameters, and those are the ones that you have to chew by exposing your deep learning model to data. So neural networks is really just neurons but, like you know, arranged in a graph, and so so when I showed you as one neuron.

B

Imagine now you just have a whole bunch of them and they're connected in this like kind of graphical form, and so generally, the input in input to your problem constitutes the input layer and then, depending on how many hidden layers you want to put in your model and again, the more hidden layers you have you're, adding more chaining of non-linearity. Then that will increase the complexity in your model and generally your output layer is also dictated by the inference task at hand.

B

So if it's a multi-class classification that you might have like you know, multiple neurons, corresponding to the number of classes or if it's just regression, then you might just have you know one number, because that's what you're predicting. So, let's dig deep and look at a little bit what the math is all about. So let's just focus on inputs layer, so I've been using X as my input. So that's fine.

B

You still have the X, but then, as I propagate X through this model, first layer, I, I'm, multiplying it with the W of so the W 1 and B 1 are the parameters specific to this first layer? Okay, so this is like this is just what I showed you with the neuron cell like, mmm but maybe I put it in kind of matrix notation, but this shouldn't come as a surprise.

B

Now, as we propagate the signal from the first layer to the second layer, we see that we've added yet the magenta like math to it. So the second layer now is transforming this output signal by multiplying with yet another weight. That's specific to this second layer and adding this second layer specific bias and then putting it through the non-linearity. So I apologize that I'm a little bit lazy here.

B

I should say that you can choose different nonlinearities to you, know specific to your layer, but here I just put F in general, but you don't have to you- can have different like non-linearity, and now what about our last layer, the output layer and generally the output layer? We kind of keep it as a so it is regression and we generally just keep it as a linear thing without there without the X and so altogether.

B

This chaining of mathematical transformations across layers is what constitute your model, and so that's the same model that was in a workflow earlier and now the theta are. The things are highlighted in yellow. So when you're going through your machine learning, workflow and China kind of train your model, you are actually tuning all these w's and all these B's.

B

So what's the difference between just a vanilla, neural network and a deep learning neural network, it really is just the fact that you have more layers, that's deep and so back in the 80s pretty much. They don't really have the data nor the hardware to really achieve the kind of massive models that we have now and what's really nice about the kind of neural networks that we can train out is that you guys are so massive. We can kind of peel back at the layers and examine what the model is really learning.

B

And so imagine we have. You know, picture of George Washington's the picture of George Washington.

B

We see that in the they're more kind of first layers, it's more just like rudimentary edges and then pretty soon, as you move from the layers that are closest to the input to more like they're farther out layers, then you see like parts of eyes or, like you know, parts of face and eventually entire face, and so going back to the picture where I was trying to Ferengi between machine learning and deep learning and how deep learning can just learn the features this is.

B

This is how it's doing it essentially for free, it's built into the graphical structure that we can peel away these these layers and be able to see what features are relevant and so sometimes um when we do deep learning, it's not just trying to train a neural network to predict something sometimes, as we will see later, we might want to artificially pose a problem to the neural network, to trick it into learning something cool in here, so that we can then take those features and do something else with it.

B

So just kind of keep that in mind, and so you guys might think like well like what happened well for those of you guys who are alive. I guess um like what's what's the like what happened between 80s and now?

B

Well, it's really like the confluence of three things so, like long time ago, in the beginning, so like yeah, the 50s, so essentially they can only train really small models because first, their hardware, limited they're, also really data limited and then in the 80s they're able to kind of figure out. You know more tricks in order to kind of develop the early neural networks. But then you know poor people or researchers. They are still stuck with really limited hardware, but at least hey. They got em this, but now we're like we have.

B

We have image net. We have recipe one and like we so much data, and for those of you guys, you know who you know, take pictures and post them and like are super active on social. You guys are everyday contributing data, so essentially the data I mean we have. We have a really good handle on data and and also with the investments of like Nvidia and other other hardware companies really investing into their hardware infrastructure.

B

That's really out like it's like the confluence of everything, including super smart researchers that are figuring out new ways to train train things deeper, better with residual layers and things like that. It's a confluence of the smarts, the hardware and the data. That's really kind of. Let letting us overcome this like sad, sad, AI, winters too, like explosive growth right now and you know explosive growth. I I can't leave this without kind of just pitching again. Now, deep learning is truly everywhere.

B

It's in image classification, as you tell so yeah like when you upload your pictures to any kind of cloud service oftentimes they immediately categorize your faces. So all deep learning- and it's also starting to play a really big part in medicine and biology as well so such as like when people have diabetes that they often have. They often could go blind because of some kind of diabetic.

B

Written not I'm not pronouncing this right, but we are also involved in some healthcare projects where we're trying to help build deep learning models to help them, diagnose multi-modal, sorry, diagnose illnesses based on multimodal data such as radiology reports, as well as like images and yeah, so on and so forth. Essentially, you are pretty much touched by deep learning like everywhere like if you have a phone, it's it's is touching you right now yeah.

B

So that brings us to a really quick intro of deep learning, now I'm going to get into the three main branches of machine learning and again, I promise you, like not bog, to spend too much time on the on the super, classical methods. The reason is, if we were to do this kind of you know this lecture like right.

B

We need a lot more time to kind of go over all the classical methods and so I'm going to kind of mention them briefly, but then kind of quickly tell you that within the deep learning perspective, how we achieving each one of these different types of learning, so machine learning actually has three branches, and we see here there is supervised learning which, if you recall them machine learning, like you know the workflow, that's pretty much like supervised learning, workflow right there, where we're, assuming that we have some kind of target so that our data is considered with label and depending on whether we are predicting something.

B

That's categorical, that's classification or we are predicting like a real value number. That's regression now unsupervised learning here, it's kind of like things like clustering and dimension reduction. Where you really don't get the target, you don't get the Y's, you only get the X and maybe some one, sometimes it's just too expensive together Y, so that, like you, just want to see what you can do with the X and in general unsupervised learning.

B

The goal really is to uncover kind of structure in this unlabeled data, and so naturally, like things like clustering and dimension reduction, seems to be there kind of approaches. One would take for unsupervised learning now. Reinforcement learning has been getting a lot of attention because alphago and like all the other cool things that and even like autonomous, like cars, so in a nutshell, it is learning actions based on feedback from the environment.

B

Essentially it it's in a realm of sequential to vision, making and it's like all together all these three fields kind of come together and constitute our everyday machine learning.

B

Even though most of the time you know we're kind of focused on supervised learning, but all the other areas sort of all play I'll play well in everyday life and and I was gonna say that like even though these branches are like branches like there's technical names to it, I want to say that even in our everyday life, like these kind of types of learning are not too far removed from like the way we do think so. For example, like imagine, you are learning how to drive like you.

B

Don't know how to drive yet but you're, learning on a drive and so you're taking your driving school or, like maybe you're your parents. You know your parents like watching you and teaching you so immediately. That's supervised learning, because if you like make a turn too close to the curb someone's gonna like like I, don't know step on a break or or do something so very supervised learning, but then say once you're comfortable enough to like drive on your own, and so, as you are as you're driving. Of course, you are taking actions.

B

You are reacting to the environment. Imagine you have to you have to make a left turn. One of those left hander has no like steak. No, you just gotta be like aggressive, so say the first time when you have to get to work you you never got to work on time, because you are there for like 15 minutes, but then next day, you're gonna be better and so on and so forth. So learning from experience, that's reinforcement, learning and unsupervised. Learning is sometimes saying you're like driving in like a different city.

B

So recently I went to Rome and like those people they just try, they don't stop. So in a way like you, you can partition, you know people's internal like driving behavior, so that you can react accordingly. So you know pretty much all these three branches, even though they're in machine learning but they're really, like you know, there's analogs of it in our everyday life, so they're not that for it. So let's talk a bit deeper about how they're different so as we know, supervised learning.

B

We have our targets, that's a label right and generally once we train our neural network or machine learning model. We would then compare against it. Sorry, so this is. This is what we are predicting and it were comparing it against the ground trip. And generally you know, if it's continuous remember, we use some kind of like root mean square and if it's categorical, we use some kind of cross entropy and so that's a loss. That is then used as a signal to tell us how well are we doing?

B

How good is my end and whether, in the end, because it depends on the theta whether I still need to tune my theta? So that's supervised learning, and we know immediately that it's immediate feedback, because, if you're not doing well immediately, you know you're not doing well. You can use that lost signal to kind of tune. Your parameters, but reinforcement, on the other hand, is a little bit different.

B

It's kind of like well, you have what's called like delayed rewards because sometimes like state, you are playing a video game and you are moving your joystick or if it's Xbox, you do those things like that, and you may not know that. Oh, like you, should have done something else until you dock and you might die- maybe five minutes later, because he's used up your ammo or something so in a way that the reward is not immediate.

B

Sorry, the signal is not immediate, but the reward kind of comes generally in a delayed set and the output is not something that you, someone can tell you. That's the right action today.

B

It's more like you took the action it influenced the state of the world, and that in itself would then give you this reward, which you can use as your signal to how well you're doing now unsupervised learning is like you can no feedback, it's more like uh stirring, so you just predict, but you may not like know exactly how well you did I mean in a math sense, but of course like when we predict something, of course, like we have certain hypotheses and other kind of of sign knowledge that comes with doing machine learning.

B

So generally, what would leverage those to see like? Oh, like do the clusters make sense, and so on and so forth, but in general, like these three can have run the spectrum from supervised learning, reinforcement, learning to unsupervised learning in this spectrum of feedback to no feedback.

B

So now, let's dig deep into like each one of them, so supervised, learning, mainly or then split into two other categories: classification and regression classification is when you are trying to predict something that is like a class or something and Oracle regression is when you're trying to predict something. That is a real, a real number. You know how tall how tall I should have been if I had slept when I was younger or I.

B

Don't know like how like what's the average age someone would live to if they was diagnosed with this like illness, it's like so on and so forth. So these are the classical methods.

B

So, as you can see here so in the interest of time, I'm just gonna like talk really quickly about like just select view so in the classification family right, I'm generally naive Bayes- is that you literally you have your data and then you kind of do your counting in order to kind of get these probabilities, and then you apply Bayes rules and then that's how you build your naive, Bayes spam detector.

B

Again, it's super simple and because of that, spammers have now gotten really smart about like adding specific words that are like that would kind of bias the algorithm. So people generally don't use naive, Bayes spam filters anymore now decision tree here again country is pre deep learning and essentially you give it data and it's able to partition the data into very interpretive all types of branches, so that so that the roots of these are what gives the prediction um so decision tree.

B

Has the you know great feature, sorry great quality that it's very interpretive now SVM was pretty hot back when I was a kissing, grad school. Essentially you have your data and you know that the goal is essentially to separate these classes with hyperplane that maximally separates them and generally, you might kind of put your data transform it into a higher dimension, so that this is possible, but nonetheless, like these three I mean there are others, but these are the kind of classical algorithms from the classification sub branch.

B

Now, as for regression, requestion is kind of um it's got his roots in statistics right um and essentially linear and polynomial. Regression. It's pretty much like what you see what you see here, yeah so I just covered the classical methods, but now, let's talk about with deep learning, so with deep learning. How do we do regression and classification? Well, sorry, I got tired. I didn't do the cool animations anymore, so they are with me.

B

The idea is that you still have a network right and if you are doing classification, where is just like true or false right, then yeah. If it's only true or false, then essentially you want to predict the probability. The number you're predicting is a probability that, like class one, is true and then the other class we'd like one minus that probability. So that's why you only have one output neuron and it's predicting within the range zero to one, and so that's why?

B

For the output non-linearity generally, we will choose something like the sigmoid function, because it's cautious the signal to be between yeah. So this is how you would do it for classification.

B

If it's just true vowels, and once you get your prediction, then you will compare against whether it's actually true or false, and by using this cross entropy laws to help tune your parameters, but what I'm but say like if we have a very related problem, but now we have multiple labels, so previously it could be like in my female or male, and it next would be like. Am I over 40 years old?

B

So you know, I could I could be true on multiple categories, and so that's why, instead of just one number now we are predicting several numbers and several numbers correspond to the number of classes in which I can participate in so and that corresponds to the number of output neurons as well. So you can see we pretty much. We could keep the same like if your input is practically the same. All these layers could pretty much be fixed and all you're doing is essentially well.

B

First of all, you need to make sure your target is now multi class, and then you make adjustments to the output layers and adjust your laws if necessary. But here we can still use the cross entropy laws, but, for example, if we wanted to request regression again, I am predicting just one number and there's a real value number. So it's between negative infinity to infinity. Then you see how I changed my non-linearity from sigmoid to linear, because I don't want to squish.

B

My output to you know zero one anymore, because I wanted to go free to to be between negative infinity and infinity and also I would then modify my loss function to be. You know the mean square error, a shell.

B

So what I'm saying is what supervised learning deep learning is actually pretty easy peasy, because as long as your input stays the same, if you want to do regression or classification, you can actually do it and in fact the computer vision people say if they want to say if they have this picture and they want to know what is it and can you draw a bounding box like around it around it, they're able to use this neural network, which is really small, but trust me is a- is a neural network with some other layers.

B

But then you see how they're able to split it into two parts, one that is doing the classification right. It's predicting the probabilities that it belongs to the class, CatDog and so on and so forth, and then this regression part where it's regression is regressing.

B

You know the starting point of the bounding box XY and then how big the boxes and then together they have ways to kind of combine both los together to actually solve this problem whereby they can now give in this picture. It will be able to up with the category as well as a bounding box. So to my knowledge, I, don't think that classical methods can do that.

B

So that's like a really that's a big win for us who have access to deep learning, so deep learning again for supervised learning is pretty straightforward and for unsupervised learning. That's the second branch where you don't have two labels, and so you don't really get any kind of feedback as to how well you do, and so generally we do things like clustering dimension reduction or some kind of Association and so um yeah. Let's, let's dig deep a little bit and just go over quickly, some of their classic methods. So here this is like k-means here.

B

So k means essentially that you have some data right and you also have to specify a number of clusters and you randomly initialize your clusters to be some point and then what you do is then is then you assign the neighboring points to the cluster as shown right and then based on the state. This is your cluster, then you recompute you're centroid to the customer, and then you do it again and again until it's kind of partitions. The data in this kind of nice, neighborhoods and DB scan. So these are our clusters.

B

Db scan is another one of these clustering algorithm. Where is imagine you have data points and they're people and you're telling them hold hands if you close to your neighbor, and so that's really what they do. They associate the neighbor that is close enough to be in the same cluster and any and any points that don't have their hands held is a weirdo, because it's flagged as a number and four dimensional reduction. I mean generally like we, we do.

B

We kind of use statistical methods like SVD and all these kind of, like latent semantic analysis, to kind of get at this unstructured data age is something that's more structure, so yeah, so the classical methods. So these are some of the classical methods. Now, let's talk to you learning.

A

B

Deep learning I feel like it's, it's actually pretty exciting in terms of like unsupervised learning, because people doing unsupervised, learning within deep learning. um It's actually quite clever, they've, been kind of leveraging a lot of what's called self of supervised learning, and so imagine that we want to. We want to compress data, and we have a picture like this, but remember I told you that sometimes we trick our neural network and you're doing something dumb so that we can get something cool on inside. So this is exactly one of those times I'm telling it hey.

B

This is my input reconstruct this input, like that's just like weird right, but by having the layers kind of progressively smaller, like a show soap. So I drawn it like a bowtie thing, not because I is pretty it actually is this of the mayor's, as as we cannot get to this latent feature, which is what I want to force a network to kind of compress this data into, and the decoder is essentially like the key inverse so so encoder again compresses the input into this latent feature.

B

So there's lower dimensional representation of my of my data and I. Call that, like you know so, if this feature, if this vector is called a tray, then as f is my encoder decoder is like taking H right, then I want to get back at my original input, and so an auto encoder is all these together, chaining it up where I want X and R to be really close together, and if this is trained properly. You get a really nice compressed feature that represents your input without all the dimension dimension of your input yeah.

B

So another thing you guys might have heard of is worth two back so back in the days before quarterback people doing natural language processing would like take a document and then what they do is that they will save is apple orange, APAR they're, literally just like make it your vector where everything is zero, except one where one is there is where it is in the dictionary, some not like a Webster's dictionary, but like your your computer program dictionary.

B

But the thing is like imagine like, if, like apple and orange, like alphabetically they're kind of far but semantically their food, their fruits, and maybe they got that reddish orange you you know so so like. Ideally, if you were to think about it, you want them kind of close in similarity when you represent them in a numeric sense, and so that's what worked effect is about it's saying that hey given this right, this, we call these one hot vectors.

B

One hot vectors means that all zeros, except for just one one, can we transform these one heart vectors and do these other factors where it's more dense, where it's not just all zeros, but it's got like non like non zeros in a more more entries, and so, for example, here if we were to do it right, ideally, we want upon orange to kind of like look more similar than car and the blue and the red denotes like values that are kind of close together.

B

So you might think that oh well, that's interesting, but how do you do that? It actually is very much is very similar to the auto encoder. So, instead of reconstructing the actual words because I want I want dips, you know what we do instead is that we predict the neighboring words, so it's kind of like giving a word. I'm gonna predict my neighbors and so I'm gonna make this more concrete for you.

B

So, for example, if you have these sentences here, like I like playing blah blah blah, so what you do is that you actually have a Center word. That's kind of like you know what. So there are two flavors of this algorithm, so you can have your Center word and in the window of words like before, before and after you they're what you call the context words, so you see how like, as you slide, your Center word kind of down. You get more kind of context. Word like your window.

B

Kind of shifts and I'm like playing is I is now the center word like play or like these two words and so on and so forth. So again, even though deep learning I said, oh, you don't need to do like feature engineering with text you you need to at least do some of these kind of like one hot, transparent transformation before you can do things with them.

B

So say you have encoded your text in this kind of data data format, then, depending on whether given Center word, you want to predict context words or given context words. You want to predict Center words that gives you like two flavors of models that allows you to learn the these dense vectors, which we call word embeddings. So again, like the auto encoder, the auto encoder.

B

Are you kind of true a lot of layers, but word defect is actually a very shallow model, where it's just the one heart factors and in the hidden layer- and that's that's essentially the word. The word vectors are the ones that are what we're interested in they're like right here, and so it's just like you know it's a really shallow Network.

B

So again, it's either you're protecting sent awards to context words or context words to Center words, but I don't want to like go in 2d, but I want you guys to recognize that this has really revolutionized like NLP in general.

B

The way that we're able to now represent words in a semantically relevant way has really paved the way for a lot of the more advanced on natural language processing tasks like captioning and all that, so another really cool idea in unsupervised learning so games is it's kind of like um interesting in that it was a Richard proposed as a consumer buys method, but it's been now used for like semi-supervised or not, but nonetheless we can't you don't really need to have the labels, because all you really need is the is the input and again it's.

B

Essentially it's a generative model. So previously we talked about classification, where is house 1 and class 2. So all those other models are more like discriminative models where from inputs you are you're just modeling what the RTB, but here we're trying to model give an input. What is the probability like around that input so such that we can perhaps use a probability distribution to generate more inputs, and so the way it works is that there's actually two parts to it: the generator and a discriminator, and so the okay.

B

The discriminator is just like what we kind of went over. We are given an image is trying to now do a binary classification problem. Is it real?

B

Is it a real picture from the training side, or is it a fake picture that the generator is is trying to create to kind of fake me out and now the generator is trying to take some kind of like you know, random noise and use it a seed to create something that is synthetic, a synthetic image but hopefully has the same kind of distribution as a training set so that it can fool the discriminator?

B

And so it's this, like two-player game, that it's got going so discriminator again is the the neural network that takes the picture and it doesn't know whether it's from the fake one or the true training set, and it's trying to essentially maximize the probability of being able to distinguish between the two of them. But then the generator is trying to learn about the distribution of the true training data set so that it can it can. It can essentially fool the discriminator into thinking that its images are actually true.

B

So is this two-player game and even though it's a really cool idea, it's sometimes it's hard to train. But nonetheless, if you are brave enough to kind of check it out, people have used stands for data augmentation, and so what I'm, showing you here is from this paper called peak fluid where they gave it a whole bunch of like fluid simulation, 2d 3d, as well as some simulation inputs and they're able to so. This is one of those plots that I've showed you where it's.

B

This is epic, and this is the loss and at each time, step I'm also showing you like what the actual you know the can fake picture looks like and you see how in the beginning is like not doing so well, so maybe it's like kind of underfitting, but as it as it continues it. It looks. It looks like real fluid simulations, so yeah, so I've covered autoencoders more to bag and gas.

B

Those three are examples of unsupervised: learning within deep learning, I think I'm, probably to skip over to learning so with reinforcement, learning it's very exciting, because I guess perhaps I I want to say, like maybe like by the time I retire. Maybe I don't have to drive anymore and I.

B

Just like tell my car to take me to LA, and you can just do it we'll see, but yeah reinforcement learning has been pretty big because well, all the other like that I guess the ubers and all the autonomous car companies, as well as there's just a lot of like automation, that is out there in terms of these industries, also putting investment into reinforcement, learning but like okay, what is reinforcement learning so as I mentioned, reinforcement learning. Imagine you an agent, so you're just bring.

B

You have fun, you have a brain and you did some observation from the environment and from this observation you kind of internalize and think like. Okay, like the state of the world, is probably this and then you take the action and that action changes the state of the world which in turn then generates a reward which you receive and another observation which you get again and that seeds your next set of action, this process kind of iterates and again like it's like way, Matthew.

B

Then it has to be again it's very like intuitive, it's kind of like it is like any one of you guys who play video games, um or you know even like try to bake muffins like anything so essentially as you're doing something new. You try different things and then based on the outcome of whether you succeed it or not, then you kind of modify your actions based on what you observe, as you know, as having done right or wrong, as well as the rewards yeah.

B

So as mentioned the agents, so the reward is a time delay, feedback and the agents. Job is really to take actions that maximize the cumulative rewards, meaning like if this is a game, you want to maximize not just your next time step reward but like your rewards across all the other time that you're playing this game.

B

So let's formulate this a little bit more. So a Markov decision process is a nice way of kind of kind of describing this kind of decision process, and so with a MDP Markov decision process. We have observation, space action, space and a way to kind of encode. How do states and action make can trick, can help of the agent transition to observe the world transition to the next state.

B

So that's the state transition function and also your rewards, because that's the signal that you will get there's some kind of discount factor, because because if you were playing this kind of like over, if and horizon you want to make kind of, you want to discount it so that it's like temporally like relevant, and so we also an MDP Markov satisfied. Clearly the Markov property such that the states at the next time step only depends on the states of your current access and your actually. So with this.

B

Essentially, when you formulate RL problem, you pretty much have to know your observation, space action space and in all these formulations and some extra concepts. So in general the agent is trying to maximize the rewards, and this reward is shown it's this kind of like weighted swell. Is this way to soundest discount it some of the reward that it would get at each round? So this is what the asian wants to maximize and a policy is described.

B

How an agent should behave, so it's a mapping between I'm, sorry from state to the action and the value function is essentially a way of saying, like hey, if I'm in the specific state. What is my expected rewards like how good is how good am I, like? Am I sitting pretty in this state and yeah? So these are the V and Q values that kind of represent how good it is to be in a certain state.

B

Now, how do you go about learning an optimal policy, because, ultimately we want to maximize this, and we know that our rewards is due to us sticking a good action and you know being the right state and everything. So therefore, we need to learn a good mapping from state to action, so a good policy, and so imagine if we know how good it is to be in a specific state, sorry um that for every action we know how, how good that expect that rewards would be.

B

Then we will be able to just do an arc max, but this Q function is tricky like how do we? How do we exactly know? We kind of have to play the game and kind of observe what rewards we get in order to find out right, and so what we could do is we can actually frame it as another. One of these, like regression problem where, at a given time, you're trying to model what your Q is. So you have a model that is trying to predict Q and at any given time.

B

That is your current estimate, but then, once you take the action and you get your rewards, and this is what you actually actually got, and so you can use that difference as you feedback in order to kind of learn this function, and so traditionally, what's been done.

B

Is that essentially, every time you have a state action combination, you log it, and you know that's why it's called tabular right, but with deep learning again we are able to learn functions that could map States to the Cuba, and so that's what we do, and by applying this I mean essentially this paper in 2013, where they applied the deep learning towards the cooler so deep learn deep neural networks for q-learning on atari games.

B

Essentially, they use like several frames and then they put it through again one of these neural networks with convolutional layers that I'm sure stuff was gonna cover later on and then out they just computes the Q values. That's specific for, like you know, like a shown. So as a result, it's kind of like anywhere reinforcement learning has places where it's approximating a function. Deep learning has kind of fine found its way to kind of like fulfill. That need, and do it really well and yeah pretty much.

B

That's that's what I have for reinforcement learning, but I also want to call to your attention that there are other types of learning such as transfer, learning, semi-supervised, learning, active learning, so transfer learning is kind of like once you have a model that you train. You now have a related problem. You don't want to throw away all the all your hard work of tuning all the parameters it's kind of like. Is there a way you can transfer that knowledge over? So that's transfer, learning and semi-supervised, it's kind of like a mishmash between supervised and unsupervised.

B

Imagine for supervised learning you have! You can only afford to have a really small training data that has labels, but then you have a whole bunch of other data that is not labeled, so that's unsupervised. Is there a way to perhaps learn structure from the unsupervised and then use that to kind of provide good initial values from the supervised problem, so there's like different ways of they're kind of like mishmash of supervisor unsupervised and lastly, active learning is when again you are. You don't have much training data. You only have like a few instances.

B

It's their way that if you can have a model, they're not only predicts the the prediction, but also tells you how uncertain it is for you to then compute some kind of entropy, so that you would then use that information to know at a given time where your model is uncertain so that you can then selectively say. Okay now, I want to choose that point where I know my model does not do well in and get a label for it.

B

So that's kind of active learning, and even though today I didn't talk about deep learning being able to do like uncertainty. Characterization, there are bayesian versions of neural networks that people have useful active learning.

B

So this is my last slide like in summary, that is the Venn diagram where AI is, is the biggest field and then inside it ml and then inside ml is DL and DL are great because it learns hierarchical features which are super useful and really their growth is fueled by the fact that we have tons of data and hardware events as well as really smart people figuring out like the practical aspects of training things well, so that they could get the kind of amazing results to get us all excited and the three branches of ml, the main branches are supervised, learning and supervised, and reinforcement learning and for all three of them.

B

Deep learning, really, you know, is it's really a team player with all three of them, with supervised learning, we've already seen it. It's super easy to kind of go between with question and calcification. You can even do it together. Remember the cat and the bounding box, which is super cool and with unsupervised learning we saw that people use clever way of creating a fake signal. So remember the auto encoder, the signal the target is itself or the word to back.

B

The signal was nearby words, so there's really no true labeling involved, but it's using kind of like artificial, yet meaningful question within the structure itself to to do the supervised learning of the structure and reinforcement learning. Well, it's been I've only talked about. You know how the dqn, which is using reinforcement, learning to approximate the cue function, but there's other words where reinforcement.

B

Sorry, deep learning can also be used to learn the policy mapping directly, as well as to approximate the the models that go into the Markov decision process, and once you have those models, then, essentially, if you have a good approximation, then you can. It boils down to a planning problem which is easier to solve well, in most cases yeah, but that's pretty much it and if you guys have questions just feel free to speak up.