DevoWorm ML, 20 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DevoWorm ML: Week 12 (Reinforcement Learning)

Description

Twelvth DevoWormML meeting, November 20. Attendees: Richard Gordon, Ujjwal Singh, Vinay Varma, Bradly Alicea and Jesse Parent

A

A

B

A

A

A

B

A

Yeah I, like working at night.

A

Jessie are you, you said that you would not be able to talk today, but you can listen in fine and those walls here. How are you doing.

A

So we had a request that we moved the meeting time to Mondays. At the same time, we moved it up one hour for Vinay. This is Niels. Our is Emil. He liked his meal plan doesn't alone to meet up the an hour earlier so but now I'm asking if people would be okay if we moved the meeting to Monday's at the same time, so it would be just a change in day.

C

A

Visual thinks it's okay, Jessie. Do you think you could make it on Monday, sir.

A

Okay, that's good I. Think it'll I'll try to do it for Monday's. From now on. We're gonna continue for a couple more weeks in the fall here and then we'll see what happens in the spring.

A

So we're kind of on a academic schedule for these meetings, but that's okay. Well, so, first, why should ask anyone have any questions or things they want to bring up.

A

I saw you all had a comment or two in there as well, and so I'm gonna try to wrap that up soon sure you know, I got the things from Thomas Thomas wanted to add some things in there and we put them in so we're gonna go over it one more time and then we'll try to submit it pretty soon.

D

E

Could use this people contact information to suggest the reviews.

E

Are familiar with machine learning, okay,.

A

Yeah and Kari I can see that I can find, and visual and Vinay. If you have suggestions on that as well, if you see anyone you, but in the whatever you might think would be able to review it. Just send me their contact info, no I'll, see if you know I'll, think about it.

A

Okay, let's see so whose wall had an update and Vinay has an update. You think so Vinay's update is I will be mentoring. Students for CGI this GCI this year for tensorflow, that's pretty good. So GCI is what.

C

I'm so this yeah is it's a program with seasonable ji-suk package for the students from age 13 to 18 a below 18 years. Okay,.

C

I saw this application for them entertaining and I thought.

C

C

E

Tasks are not only program engaged the Kennedy, a on designed a short documentation based or.

C

Any other kind.

A

Of so this program is specifically dedicated to get students started into like open source, so.

C

There are all kinds of tasks for that: it can be technical and non-technical as well. So.

C

I miss her. It is international I.

E

Have two teenagers in if you're interested could they participate? Oh.

C

Yes, of course, lakenya like then it'll choose an organization and then in a few it's much the same process for what waiting for the shock. First they'll choose an organization which they're interested and lineal contact members in our organization, administrators and then they'll have to get themselves tasks into themselves.

A

C

A

C

Put the link for that.

A

So you said you were going to be a mentor. Is that can I google Summer of Code.

C

A

But it's it's a bit less work than a DC.

C

Because I'd be having so many members as well under me, they'll be a few tasks, will discuss I.

A

F

C

Also, it's a bit.

A

C

A

Well, that sounds pretty good and I. Think that's a good opportunity for you. You know it's always good to have like you know, to learn how to teach people things. That's always a good skill, because you can, you know impart knowledge to people who otherwise wouldn't have the opportunity to to get exposure. I mean you can read things on the internet, but it's always better to have someone. You know help you, along with things and give you advice and so I think that's good. So keep us posted on that. That sounds interesting.

A

Okay sounds pretty good.

A

And then Jesse's going to the envision conference this weekend, looking forward to that, okay Jesse said: he's basically he's busy with grad school applications and he's fleshing out some of his own projects. He wants to present a machine-learning paper. He sent me a copy of the paper.

A

He wants to review, probably in December, so we'll talk about that offline, a bit more and then he's going to this conference called Princeton envision, which is a conference about technology society, ethics in looking at the future, so he probably can give us an update on that too, once you once it happens and everything.

A

So that's good thanks for the updates. So if there's nothing else that anyone wanted to talk about, we can go into the presentation. I've prepared here, reinforcement learning, so I was thinking like last week. I did a presentation on game theory and I. Think Jesse was here for it and I recorded it and it's on the YouTube channel. So it's a it kind of leads into this because they talked we talked about like games and competition and how that's being used in machine.

A

It's actually a very interesting area, especially when you get down to like how they're using game theory to basically as sort of a in lieu of optimization function. So a lot of algorithms use loss, functions and so they're using game theory as sort of a stand-in. For that and the reason they do. That is because you have a lot of non convex spaces, meaning that they're not normally.

D

A

They're not smooth curves that you can hill climb or you can optimize easily. So you have a lot of problems where you have a very what they call non convex space which is very irregular and something that's not easy to optimize, and so they play games between agents and they look for what they call Nash equilibria and it's a that's actually very hard to do in mathematical terms, so yeah, it's sort of like a rough landscape.

A

It's as analogous to that so they're using Nash equilibria to find these points that are sort of you know, solutions that will correspond with minimum loss. But the thing is: is that's very hard to even do that mathematical analysis, so you know we're talking about something that you know isn't really mathematically tractable a lot of the time. So you know people are applying. These models is to get a good.

A

You know handle on the problem so but today I'm going to talk about reinforcement, learning, which is not exactly that, but it's related to that. So, let's present my screen.

A

So this is reinforcement, learning and along the way. We're gonna talk about we're, gonna kind of drift into biology in psycho biology. So if you see references to that, it's related. So this is what I mean. This is basically reinforcement. Learning is learning a process with a reward, and we must associate this type of thing with animal behavior. But human behavior follows this as well. It's been demonstrated in animals like dogs and in rodents.

A

Where you have two kinds of conditioning you have classical conditioning where you associate an involuntary response in a stimulus, so there's an idea of you know pairing these two stimuli and then taking one away and you're motivated by the stimulus that's been paired so, for example, if this is the example of the Pavlov's dog, where you blow whistle and present food to the dog and the dog sees the food and hears the whistle and associates the two. So the dog starts to salivate at the food, but also at the whistle.

A

Then you, after you, train the dog in that for a while that paired stimulus, you take the food away and then you just blow the whistle, and in that case the dog will still salivate expecting the food reward as well. So you can assert that the organism associates the two stimuli, but even when you take one away, it still maintains that association, operant conditioning which is related, is associating some voluntary behavior in a consequence. So in this case you have a rat that pushes a Weaver and it this thing here. This machine dispenses food.

A

The rat will then learn that if you press the Weaver food comes out now, if food doesn't come out and they press the lever, they'll still associate it with food, so the rat might go up to a machine, that's empty and press the lever and expecting a food reward, and even if food doesn't come out, they'll still press the lever because it's you know, they're associating those two things this happens over. So this is a time dependent process. This is a diagram of what they call trials. So this is when they do this training.

A

It's an experimental context where they preserve these things in successive trials. So you present these two stimuli in in one trial, two trials through trials in this case up to 20 trials. So in the first trial we have, we have a bow, and then we have food and salvation and then the 20th trial salvation is going from an unconditioned response to a conditioned response, so you're actually conditioning salivation on the ringing of this Bell rather than the presentation of food, and so this has a long history in psychology.

A

This goes back to the 1890s with Pavlov and he did these experiments. Dogs where he associated the ringing of a bell with a presentation of food name, is able to condition the dog's brain on these items and there was Thorndike who came up with a law of effect, and you can read more on. This is just kind of to give you an idea of the history of this psychology and there's some famous experiments in here. The next person to really do some groundbreaking work in this area was Skinner and you've probably heard of BF Skinner.

A

He was the one who demonstrated operant conditioning, so this is pressing the Weaver and getting food and then both Albert bandura, who was more recent, who came up with something called social learning theory which is based on you know, reinforcement learning, but also in a social context. So this has a long history in psychology and this person, Richard Sutton, is actually known for learning this in like artificial intelligence and machine learning.

A

Now, Richard Sutton actually has a background in psychology, so he came to computer science and he wrote a dissertation called temporal credit assignment and reinforcement learning, and this is back in 1984 and Richard. Sutton Amy and Roberto Liz's doctoral adviser have written this textbook called reinforcement learning, and so this is sort of a the landmark book of the field, so this is actually available online for free through this link. Here this is like a ever get rough draft of the book, but it's free and you can check it out. Online they've been writing this book.

A

They think the first edition was in 1998. This edition that you can download is from this year so they're updating it constantly with new examples, so wikipedia defines reinforcement, learning this type of reinforcement, learning as how computational agents take actions in an environment so as to maximize something we would have reward. So this is a little bit different than some of the things that people do in behavioral psychology they have. You know you have to set up like a reward system and you have to set up.

A

You know how you're going to present that reward over time and how it optimizes learning by the algorithm. So this is the basic setup of reinforcement, learning algorithm. So you have an agent up here and it takes an action, a sub T and it goes it interacts with its environment. So the agent is embedded in an environment, they take an action in them and then that they display behavioral State. So you do something you just play: behavioral stage is SMT and then there's also a reward. R sub T.

A

So every state that every action you take results in some reward and that reward action. Coupling is a state and so to change the state. You would take maybe a different action, but to take a different action. You need a reward structure that enables you to do that. So forward is sort of like a feedback for previous interactions. So your state is your current state and your action, then, is dependent on the reward that you get for that next action.

A

So if you're doing X, and then you have a bunch of rewards that you know a reward structure that you kind of learn through doing these different actions, your state might move from X to Y, because your actions change based on the reward, and so this is a this is definitely a sort of a feedback system. Where you have you have a current state, you have a current action, you get rewarded and then maybe you move to a new state based on that reward structure.

A

It's that's all pretty abstract, but well kind of maybe show a little bit more about how that works in practice later.

A

This means that you have something called a policy which Maps the age between the agent state in the actions. So, for example, there's a poet, you design a policy to say: okay, there's this eight there's, this desirable state that we want and the agents all start in some initial state which isn't really biased towards anything. We want to give them a range of actions that they can take, but also rewards that will guide them to the right or desired state, and so the value, then, is the future or word for potential actions.

A

And if you want to think about this in terms of game theory, the policy is sort of a strategy that you might take and then the value is sort of a strategy payoff, which is how much reward do you give for taking that strategy or enabling that strategy so I want to, but I want to talk first to where a reinforcement learning fits into the machine learning Pantheon and, as it turns out, we're familiar, maybe with supervised and unsupervised techniques.

A

So we know that supervised techniques involve some sort of you know labeling or some sort of knowledge of what something is before you plug it into the algorithm. So you know if we label our data with colors or letters, we have some idea where they maybe belong. You know some attribute about them.

A

That might be useful in classifying things so they're supervised learning, then there's unsupervised learning, which is just where you preserve the algorithm with data blindly with no identifiers or anything, and it's you know, expect the algorithm to sort it out and put it into categories. So that would be like cluster analysis, where you're just saying just produce some clusters and see if they're, you know accurate or not, whereas in supervised in the supervised case, you might create categories and say put these in the right category and then check later to see.

A

If that happens, reinforcement learning is sort of its own beast and reinforcement learning involves learning from mistakes, and so you know to get a correct classification or reinforcement learning. You have to train the algorithm on a bunch of mistakes, and then it learns from those mistakes and it's the right classification term. So this is an article. I took this image from reinforcement, learning 101, and this is from towards data science, and they have more information about it in that blog, post and I'll make these slides available online, as always to so.

A

You can get the links that there are different varieties of reinforcement learning as well, so we have classical reinforcement learning, which is where we just have the rewards state action structure and there's maybe interaction between you and the computer, then there's a deep reinforcement learning, which is where the agent, instead of being maybe a like an avatar driven by a person. It's a deep neural network or something like that. So in deep RL they're using a deeper neural network as the agent and they're training the agent on something.

A

This deep neural network is embedded in an environment and there's, of course, this reward- and in case this case, observations where the world is observed, there's an action and then there's a reward, and so you can use deep learning. You can also use something called model: free RL, which involves Q learning, which is an algorithm that I'm not going to get into today. But this is a way that they use.

A

You know different update strategies for this for the optimization process, so Q learning is basically that same structure of you know: iterative learning, but you're you have a certain set of weights that you're using to weight. Evidence that are, you know, occur a certain time or further away in time it doesn't. It doesn't imply any sort of a priori model, but it's something that you can do sort of dynamically and apply the Q learning algorithm to the data.

A

That way, so one example of reinforcement, learning, which is exciting, is where they're using it to look at video games like this. So there are a couple of examples. I put in this talk the first one. Is this paper playing attorney with deep reinforcement learning. So this is an example of where they've taken some of the old Atari games and they use these games, of course, because their complexity is fairly low in terms of the graphics and the number of moves you can make, but they're still challenging to the algorithm.

A

So, in this case, you have this camera roll with this name of this game is it's some game where your move, it OC quest where you're moving a submarine around and you're trying to avoid being shot at by other submarines, and you know there's a anyway there's a reward structure, this game to maximize your point tool, and you have basically, this algorithm has been applied to this game. Now, normally, humans play this game and you know they're evading Tartt there obstacles and making their targets and trying to maximize their score.

A

Well, the algorithm does the same thing, and so they they actually in this paper they tested the algorithm, unbel, Seaquest and another game called break out and I think average. A screenshot a break out later in the talk, but basically break out, is also a game of this complexity, where you're bouncing a ball off a paddle and you're trying to remove bricks from up from a ceiling that opens up when you win the game, you know you're trying to break out of this cave or whatever that you're in. So it's it's pretty simple.

A

The number of moves, a number of like states that you can be in so that's why they use these games and they show you how this works, that we have training epochs here, so they've trained it over a hundred epochs and then they measure the average reward per episode. So these epochs say I assume our game plays so they've.

A

You know they're training it over and over again losing the same algorithm but they're, presenting with the game over and over and the algorithm keeps playing the game, and you can see that the reward, probably measured through like the score or something it goes up over time, so goes up between zero and fifty epochs. It really starts to maximize its reward, then at about 50. After about 50 epics, it starts to Plateau and the reward our breakout. At least it there's an asymptote there. It's kind of a logarithmic learning curve on C quest.

A

You see it's actually it's. It goes up to 50 epochs, where or maybe 60 epochs, where it's learning its maximizing its reward, maybe 70 epochs. And then you start to see some instability in the algorithm, where, if you keep training it longer and longer, it starts exploring new strategies, perhaps, and it's a little bit more honey even on its performance. But you can also look at in terms of q-learning. So this is the Q score, which is the value on this Q learning algorithm and in this case it's a little bit.

A

The results are a little bit stronger where you have the same logarithmic pattern of learning, but it doesn't like you to don't show this sort of variation in terms of the reward structure, so the Q value actually is maximized all the way up to a hundred epochs. But it's logarithmic. So you know you spend probably epic zero to fifty really training the algorithm and after 50, it's sort of increasing, but only nominally as it's really kind of learned how to play this game. So this is a big advance when I was announced in 2013.

A

This is if using deep, deep reinforcement, learning so they're using deep learning to do this sort of thing- and this is totally autonomous, like the algorithm- is getting no input from the from the human it's just playing the scheme on its own. So this isn't like unprecedented they've, been trying to get out. You know artificial intelligence to play games for a long time in 1997 IBM's, deep blue beat Garry Kasparov.

A

Who was the world's champion at chess at the time, and he got so angry that he decided he was gonna, create his own type of chess, which was like a hybrid chess where he used like human experts and algorithms to develop. You know optimal chess styles. I! Don't want to talk too much about that in this talk, but um suffice it to say that this is a classic sort of problem that people have used to try to benchmark algorithms, and so, in this case they used alphago to play the game go with.

A

Who is the world champion at the time of this game and go is like chess, but it's sort of a variant of chess. They play it a lot in China and they, you know the algorithm, it's it's sort of the computational chess, but it's a little bit different in. In this case. This group published a paper in nature where they were able to use a combination of deep neural networks and tree search to beat the world champion ago, and so this was another landmark, but that just shows you what these algorithms are capable.

A

And finally, we have this new paper, which is human level, performance and 3d multiplayer games with population based reinforcement, learning, so there's a reinforcement, learning using less but using a number of different strategies using a population of agents and presenting the algorithm with different. You know with information from a game. This is a I think you're, some sort of first player game, it's much more complex than the Atari games, but they've.

A

You know they're using these different scene views as data to plug into the algorithm, and these agents then are trained in a population context, and then this is what they're using to optimize this algorithm ever I read this paper very closely, but it's really interesting.

A

If someone wanted to read it and comment on it, that would be great but anyways. This is yet another type of approach to reinforcement, learning and this was published in science recently. So there are different ways that this is done so reinforcement learning. If we look at like how it's implemented in machine learning versus like classical conditioning and then even synaptic plasticity, which is in the brain itself, so classical conditioning happens in sort of a behavioral State. But it's also involves a number of brain regions.

A

So it involves things like heavy and learning and it Raval involves neuronal reward systems like the basal ganglia, which is a part of the brain that responds to rewards, and there are a number of different ways that this is a very complicated diagram. We took this from the scholarpedia site, but it sort of shows like the different things that are involved in, say, like the machine learning, implementation of reinforcement, learning sort of the classical conditioning form of it, which is a combination of like you know, updating, behavioral States.

A

But of course the brain is doing a lot of work in that and then synaptic plasticity, which is really kind of separated from the machine instance, because it involves a lot of signaling between different neural chemicals, producing states like long term, potentiation and short term potentiation, and things like that. So that's I mean that's how it's all related in terms of the biology versus the machine learning. But of course we expect.

A

Reinforcement learning behave like a human like a human brain and make decisions like a human, that's the whole or even like I'm, some sort of animal, that's the whole logic behind it, and so related issues and topics. So now, I want to move on to some little bit harder detail on these models. So the first thing I talked about before was the policy gradient and that's a very that's sort of like the core of this type of algorithm. So let's walk through it. A little bit.

A

I took this from medium post here, so you can look that up in more detail if you want, but let's walk through it, so considering what they something like on instinct, which is denoted by the symbol, pi and that's described by an action given an initial State. So we have our notation here, an action of course it's just something that the algorithm does and then it's given an initial state that it starts out at the objective, then, is to find a policy which is theta that creates a trajectory which yields maximally expected rewards.

A

So it this objective here, which is the objective of the instinct to create, have an action given an initial state. This creates a trajectory over time. So your policy is a bunch of instincts and a bunch of actions and a bunch of initial states over time.

A

So you execute one, you execute another, you get feedback, you execute another, you feedback and so on and so forth, and that's your policy and the policy of course there you have expected rewards for a certain policy, so you think well, what's the probability of this trajectory of behavior, you know the agent doesn't really make behaviors at random.

A

It could make it random, but generally their behaviors that are more likely unless lately, of course, they're not rewarded in that way, but they're, more accessible, States and less successful States and then, depending on the rewards, you might end up finding the more success, successful, States I'm, trying to think of a good example from human behavior. But let's say you had a program where there is some behavior that was really hard for people to attain. If you like stopping smoking, but there isn't, it isn't really easy for someone who's smoking to just stop smoking.

A

So what you do is you reward people? You reward them in number, a number of ways you, you know you you. Maybe you disallow something, so you kind of have a negative reward for smoking. Maybe you can't smoke indoors or you have a positive reward for smoking.

A

Like you, you have the person a reward if they don't smoke for a week and you do these sorts of rewards, positive and negative to kind of shape, the behavior towards a more desirable state that maybe is less accessible at the beginning, but it you know it's that's what you would use to attain that behavior. So it's it's! A combination of the trajectory of behaviors towards the desired state and the reward structure. The trajectories extend to some time horizon. So we don't want to project it infinitely out into time. We generally think.

A

Okay, we want to achieve this. The Averill changer this behavioral optimization in a certain time horizon, and you saw with one of the game examples they measured it for a hundred epochs. So that's a finite time horizon and, as you saw, it wasn't really needed. We had about 50 to 70 epochs that really maximized our par training on this trajectory, so the state itself can consist of either specific or generalized features.

A

So in a machine learning context your state can be, you know could be like the joint angles on a robot or depending on how the algorithm is implemented. It could be whole images with features in it, so you might want to train your model to correctly identify features in a biological image.

A

You know maybe cells. Now the machine is no understanding what cells are initially, but you would train the machine. Okay, you if it picks something that's ovoid or circular, then that's a reward for the algorithm. If I pick something that lobby, maybe it depends on the cell type that might be rewarded or with a negative reward as well, and so there are ways you can optimize this so that the reward structure, row I kind of arrows down narrow.

A

You know, and it rolls down to the features that you want and their properties and the policy objective then can be either stochastic or a series of directed actions. Like I said you can start off with a stochastic action, but just picking an action out of hat or you can direct the algorithm to certain actions, a certain subset of actions and train it. That way really depends on how the algorithm is implemented, and that also determines the effectiveness of the policy.

A

Another important concept is temporal difference learning. So this is related to policies. Then this is taken from the scholarpedia page. That I told you about earlier.

A

So given a sequence of behavioral states and rewards this is this is the signal this is the reward. So this is the thing that your the state actually and then this is the reward and you keep doing this iteratively. So you have a state you ever or you have a state. You have a reward. This state might change as you get rewarded positively or negatively, and you end up in a final state where you have. You know a bunch of rewards that have shaped you to that point.

A

So all the the sequence of behavioral states and rewards produces an action policy, and this is term the state function, and this is an expected return of strategy. So your strategy should have some sort of return to it and, of course a return should be optimal, but not all policies are optimal. Some policies are, you know pretty bad because you use the policy based on like you know, you might actually get the policy from like observing the data set and figuring out. Maybe what certain rules are?

A

You know you might see a unsupervised or unlabeled data set and try to maybe extract some statistical features from it and use those statistical features to shape the reward structure of your policy, but you might have a number of competing policies where you don't really know what the actual payoff is. You think they're all optimal, but maybe some are so. You want to try different policies and you want to evaluate them so temporal difference. Learning allows you to do this.

A

Where you have this iterative, you know you have the state and you have the reward, and you have this sort of structure. That's a time dependent and then you have this parameter, which is a discount factor, and the discounts are negative weights and this you know you can negatively weight things that are further into the future. So you can have some sort of open-loop control where you know you reward things early on and then their reward becomes less salient as you go move on as this desire as you approach the desired state.

A

So this is to sort of avoid over fitting, where you don't want to over reward the algorithm for seeking out. Maybe new states- or maybe you know you, you wanna, you have a target state that you want to achieve, but you reward it too much, and so it starts to explore different states again. I mean they're different ways. This can play out. You can do this.

A

There open loop control, which is using this sort of weight scheme, and then you can also have a sort of what they call the neuronal version, which is how it sort of happens in the brain. You can actually apply this to machines as well, where you have a neuron called V and that can predict a reward R and this updates adaptively until the algorithm or the behavior converges. So this is a closed loop system.

A

So in this case you would have like a unit neuron V, which would predict some reward, and then you update whether it's actually correctly predicted it or not.

A

Until you get some sort of answer that it's convergent and so there are different ways to do temporal difference learning, but the idea is that you're, looking at the difference in time over you know your states and you're looking at the rewards in time and you're evaluating this whole structure so that it's you know you can achieve your optimal state and then finally, I'm going to talk about something that you might find in the literature called the exploration exploitation trade-off.

A

So this is kind of related to what I was talking about with overfitting of your model or underfitting for that matter. So you have this idea that the algorithm can exploit different areas of the of the state space. Like you know, if you wanna like change behavior, you have to have a state space where your different behavioral states that are possible, but you don't want to explore every state.

A

You want to be able to find an optimal state, but you also want to explore enough states so that you find the optimal state that you desire so on the left. This is an example of this trade-off in terms of the amount of information versus returned, and so in this case it's finding figuring out if the sky is blue or not by asking people. So, let's suppose you don't like you're blind or you have blindfold on you're on the sky is blue, so you start asking people whether the sky is blue.

A

So if you ask one person, which is a very small amount of information, if the sky is blue and they say yes, that's a pretty high return on your investment, but you've course you don't know if they're lying to you or if they're like no. If there can't see you either, so you might ask ten people whether the sky is blue.

A

Now, there's less information in asking ten people, but the return is also a little bit less as well, because you ask ten people you'll get a lot of redundancy, but, on the other hand, you'll get an average, and you can tell that way. If 50 people tell you, the sky is blue. That reduces your mod of information. It also introduces your return on investment since you're asking more people and then 2,000 people told you the sky is blue.

A

Your amount of information is very low relative to your return, and that's that's assuming that you know, like the amount of people lying to you or don't truly know, is very low, and so that's the idea behind this. You don't need to explore everything, but you do need to have some information and the amount of information decreases as you explore. So another way to think of this is the one arm or the n armed bandit problem.

A

So the N armed bandit problem and you'll find this not just in reinforcement learning but in other areas like genetic algorithms, and the idea is that you have like is called the one-eyed bandit originally, because I was based on a slot machine which is right here, and the idea is that you know you pull the Weaver and you know you lose your quarter or you're how much ever money you put into it, because a slot machine is basically, you know just playing a game against nature.

A

You know you're playing yeah, you know your odds of winning are very low but pulling pulling the Weaver. But you know you keep playing, you might win and it's you know it's it's the classic gambling problem. So the idea is that you know you're playing this game against nature and you're trying to get as some sort of payoff from that.

A

But the idea is that if you so you you play one slot machine and that's exploration, you're, exploring that state spacing if you can get to a certain state and since you're doing it randomly, it will take a long time to get there. But if you add a bunch of machines in parallel, which is the entering bandit, you can keep exploring by pulling a bunch of levers at the same time and I. Think if you've ever visited a casino you'll see this.

A

Where there are we people with a bucket of quarters- and only you know, pull a bunch of you know- have a bunch of slot machines in a row that they're playing all at the same time and they're trying to win by distributing their chances. Of course, you know, doesn't necessarily work that way.

A

In this case, it would allow you to explore a vast space in a very short amount of time, but there's a this trade-off exists, so you can use as many bandits as you want, but you know just because you're playing an infinite number of bandits doesn't mean you have a better chance of getting to that point. So this is a something you should be aware of the Center on bandit problem' and then finally, you have multi-level optimization, which is where you have you can use.

A

You can break the problem up into different levels where you have different agents in that in those levels exploiting and exploring things differentially. So you can break the problem up into modules and you can explore the problem that way using your reinforcement, learning algorithm and each of these agents would be a single agent in an area for in reinforcement, learning context. Each of these agents would have you know the ability to learn from rewards and then eventually get to the answer, and so those are all the slides I have.

A

There is a for those of you who are interested in genetic algorithms. There is a link, of course, between this and genetic algorithms I and highlighted it in the last slide, but there's definitely like a lot of commonalities, both in terms of using a biological process to to look at data and look at problems related to search and sort of the some of the concepts that are used. So, let's see what our chat window looks like here, so.

A

Let's see where we're okay, please hit hide button and they did that Richard asked. Did they give curves for a naive human player for comparison? I, don't remember that they did you mean with the plane, go and playing the other games?

A

Well, they didn't really I didn't I, don't know if they've measured that too much I know that the that I know from like some of the literature on training in like in the psychological literature that there are curves for human learning, in games and in expertise. So you see that same pattern where you have this initial burst of learning, where you're learning the parameters of the game and then there's a plateau where you're kind of learning the particulars of the game, but you've already kind of mastered the basic aspect of it.

A

So that pattern that sort of logarithmic pattern of learning is the same between humans and machines, but I, don't know how they match up exactly.

A

And then Richard also asked a question about an application is use of use of x-ray, photons and computer tomography. It may cause cancer. So how can we keep the number of photons used to a minimum so that that that would be I? Think a good application? I, don't know people have done that I mean it sounds like they may. Have people like to explore these type of problems, not.

E

Really, the you see, Oh already goes down. The pumps goes down. Okay, see. The question is: what is the mid one more photons? You need to achieve sufficient image to do a diagnosis, okay, so the number progress there, but no one's looking at that question seriously no yeah, especially because most of the algorithms current based. So you have to change the community of.

A

E

They do the best possible with the small number of photons, rather than such a large range of the cheese's law of large rivers.

E

A

Yeah definitely.

A

Yes, well I'll, think about that you know so anyone else have any comments. Questions.

E

Well, let me make just one point about that: computing, our computer to month. We now produces the lion's share way to sorority of the amount of dose that people get from medical x-rays. So yes, so that's actually the serious problem and therefore many papers, at least that contains multiple children can cause more cancers than the first.

E

So it's it's serious for children and it's probably problematic for adults. Great.

A

A

Well anyone's interested in that further, you can ask dick, if you know you can, if you want to follow up on that, I'll think about it. Some more anyone else have any comments.

A

Okay, Vinay no questions. I. Think this talk you a very nice overview, a reinforcement, learning, okay, yeah I think that's a definitely an interesting technique. I wanted to cover it because it's you know it's in that. It's like. We talked about machine learning a lot, but it's like you know, they're different sort of versions of it and, like you know, it's it's actually become pretty popular, especially for like applying it to games, but also they're, probably a lot of other techniques that can be you know.

A

Can there are things that can be applied to deep reinforcement, learning you'll, see it's very hot in the literature right now, like the number of the opening, like groups like open AI, and when do I sleep well, I just put that together, I was doing some reading on it so actually become pretty adept at putting together talk. So your slide shows like this so yeah. So, okay, well, I, guess we're at near the top of the hour. Okay, let's see otherwise a cup man Jesse.

A

When it alright so as well says, one of the applications reinforcement learning is in self-driving cars you're using the same techniques in our project as well. Yes, I am part of a team which is working on the self-driving car project. Okay, that's nice! So this is at your University.

A

Yeah, okay, yeah so he's at IIT Delhi. Does that correct and they're working on the self-driving car project and self-driving cars are interesting because they they can figure out some of the aspects of self-driving cars but like they still can't figure out whether the people in the crosswalk at certain times. So so that's I mean that's interesting. I hope you guys find some success and then Jesse says I have things but I'll say on slack harder to say here, interesting influences of trajectories and structures that support capacities for development.

A

Good talk things, yes, like I, said: there's a lot of. There are a lot of connections to like things like feedback and genetic algorithms and other things that aren't immediately apparent, reinforcement, learning but yeah. We can talk about those as well so yeah. You know thanks: okay, well we're at the top of the hour, and so when we okay, what about an algorithm for avoiding self-driving cars? Oh yeah, you dropped out when we were talking about that.

A

So I said that they've made some pretty good advances in self-driving cars, but sometimes the self-driving cars can't identify whether there's a pedestrian in the crosswalk. So there's a there's, a problem with that and it's something that needs to be solved. Obviously, before we can really put self-driving cars on the road in large numbers. So it's you know they have them. Apparently in San Francisco they have a lot of self-driving cars and you get reports of people getting like. You know an accidents or things like that. You know it's a problem, but it's something.

A

Of course it needs to be solved before widespread adoption. So all right so uh well thanks for attending everyone, and if you need to contact me over the course of the week, send me a slack message or email and next week we'll try to move the meeting to Monday morning instead of Wednesday and I'll, send an email out about that all right. Well, thank you. Everyone have a good week, see you later.