Numenta Live Streams, 21 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Re-Work Deep RL / Applied AI Summit Day 1 Recap

Description

https://www.re-work.co/events/ -- Watch live at https://www.twitch.tv/rhyolight_

A

Okay, so I have been at the AI Senate, so this is a conference from rework company that Australian company, so I attended mostly deep reinforcement, learning talks, I think.

B

I went to a couple of eight applied AI summit talks and.

A

I took notes throughout so I'm going to show you the notes, I'm going.

B

To go through, if you want to see the schedule, I would go to the.

A

Deep reinforcement learning one because that's that's.

B

A

I that most of my notes are.

B

A

If you want to follow along so.

B

The first one is secure, deep reinforcement.

A

Learning and I took notes about each one of these, and forgive me I'm gonna, try and keep data hey depth where thanks for joining, let me make sure that's this.

B

Being displayed properly, yeah, that's good enough is.

A

B

A

You can't see all of my notes: okay,.

B

A

Talk, oh man. This is a whirlwind, but this has been a long day, I'm ready to go reward myself with some.

B

A

B

I'm gonna go.

A

With some Pakistani food, what.

B

A

So let's talk about the first talk.

B

I went to it was I'm gonna blunt yourself about deep reinforcement.

A

Learning I didn't.

B

Know a ton about the reinforcement, learning I know conceptually.

A

B

I learned more today, so this talk.

A

Was by a professor at UC Berkeley, it's called security, deep reinforcement, learning and.

B

She was talking about ways.

A

To attack reinforcement learning systems.

B

Most most read.

A

Deep reinforcement, learning systems aren't even deployed in production today.

B

A

A big research topic so.

B

We're already talking about.

A

Attack them, she talked about three different sort of ways. You might attack the system, one being integrity, which is try and prevent the system from doing what it's supposed to be doing. This firm like by like acting with it like doing certain things with it that exploits it and keeps it from doing what its what it should be doing and then the other is or or to do something. The attacker once tell Jeremy nice to see you so the the.

B

Next one was confidentiality and privacy which.

A

I thought was interesting: I hadn't quite thought about that before which.

B

Has a way you can learn sensitive information and then the third one was really attack. It was just like a statement like.

A

B

Organism generalization and.

A

Should be in couple points about that, so the first.

B

A

That she talked about with regards to integrity was just talking about the classic adversarial image classification examples.

B

Like she showed the.

A

South side and how you can place stickers on the stop sign and, and it will misclassify.

B

Added some other type of.

A

Roads like a speed, limit, sign or something like that, and.

B

She's saying the same thing so same adversarial.

A

Attacks can apply to reinforcement partner because a lot of these reinforcement.

B

Learning systems are at their core.

A

Analyzing images.

B

So you can, you can use the same types of.

A

Adversarial image, attacks on reinforcement systems even.

B

And she used some of the game examples.

A

Of a pong game and just by adding really small advert where she called adversarial perturbations, really.

B

Small perturbations that I couldn't even see right, but there's ways to just change a few. Some of the pixels.

A

B

And it just locks the system up.

A

B

Because the system is so sensitive to the specific environment of the game.

A

So if you start, if you can add a little bit of noise, a little bit of fuzz or.

B

Just you know pop a pop up something here: they're just randomly.

A

At certain point, really screw up the system and it'll get locked into a state where it can no longer perform I.

B

Thought that was interesting interesting. You call that adversarial probation.

A

B

You don't even.

A

Need to do it to all of the frames it's a being or each of you.

B

Don't have to have that consistent through the state under the system that it's that it's receiving you be just injected in some of the frames, and that was enough to lock the agent into a seat like that. It can't recover from so that was interesting and it didn't seem very hard.

A

B

If, like, if you're deploying system like this and the public has any control over the states that the system is observing.

A

That could be problematic so.

B

Another really interesting thing.

A

Was about confidentiality.

B

And privacy, so in this remind you didn't say, spoofing, but this.

A

Is sweeping so.

B

If you, if you and this reminded.

A

Me of like how Google autocompletes, so if.

B

A

Yesterday, yesterday I went to there.

B

You go, I was a school in Spanish, I, work.

A

And I went to work in Spanish, I went to the store or whatever it's.

B

Do do this autocomplete stuff.

A

B

There's there's a model behind this that has tons of information about all the things.

A

That people have typed in if you train the model on sensitive.

B

Data, you can say something like.

A

Going out on a limb here, but Matt where's, social.

B

A

Number is and then autocomplete that gave my prediction: what's going to be opposite, Google's not going to do this, but that's.

B

An example of a way that you might you.

A

B

A

B

Sensitive information.

A

And ways of doing that may not be evidence.

B

That was pretty interesting.

A

B

Way to get sensitive information, she gave an example of a.

A

Reinforcement learning housecleaning robot, you know that knows.

B

How to navigate your house as learn the environment of.

A

Your house, so if you take that model that that reinforcement.

B

Learning agent has builds about the environment.

A

B

Can infer structure about that environment.

A

Even when the environment isn't around so like, if you took that mob- and you put it into like an empty house and you.

B

Told it to navigate to the bedroom, it would still go to the bedroom, and so then you'd know where the bedroom was. You know what I mean so so that was interesting. You don't need the environment to extract information about.

A

The environment she.

B

A

About ways that you can infer the.

B

Dynamics of this is, unlike the friction coefficients of the floor for.

A

B

She seems it seems like you're able to not only understand.

A

The layout of the environment, but.

B

Also, some of the dynamics of the environment for whether.

A

It was carpeted or wood floors or laminate tile or some.

B

Interesting to.

A

B

Comments about generalization were.

A

Basically, stating that the green words are over this and it's.

B

Brittle to minor environment changes, so it's not. It doesn't generalize.

A

Very well, and it.

B

Seems to be a theme.

A

In this Congress that this.

B

Is well understood by researchers.

A

And they are trying to create benchmarks to.

B

Identify define.

A

Some type of evaluation metrics to you.

B

Know benchmark to.

A

Help benchmark.

B

Generalization capabilities, so this is internalization. There's a lot there's a theme of generalization you'll see.

A

So and I don't quite.

B

Valuable, if you trade on a default environment and.

A

B

Can test on that environment and then you can interpolate? Maybe you want to train on some random environment. Test on random are great, but.

A

You can't train.

B

On an epoch learn another environment that doesn't transfer easily now they're working on methods to do this, but it's all fiction acts.

A

B

The these are extreme environments and half they're, just really bad. On.

A

B

So this these models don't transfer over from the barn environment. You.

A

Know I sort of knew but I.

B

Take away it was, nothing is good yeah, like none of the there's. None of these.

A

Generalization, that's just the way it is I. Also.

B

Know there's something called AI retro, whether.

A

B

Was an LP but learning to act.

A

By learning to describe- and it's.

B

Sort of, like you know, P+.

A

Applications and from.

B

Learning the different.

A

Packs like how how you can use reinforcement wearing to help understand language, because they do have.

B

Was you can create these different models for a speaker.

A

Both different agents, that's generating text well,.

B

That's listening protection and then.

A

Trying respond with a fist and one.

B

Of the themes was trying to treat language is a form of game play.

A

B

The model the server to classify, uses that Melissa model as a reward function procedure.

A

For an interesting, the.

B

Takeaway there was language is Burke.

A

Game theory: the.

B

Next question was working with you: Berman LP I, don't know any.

B

On to the next one, this is 5, so this was Google.

A

Brain called rewards.

B

Resets exploration, bottlenecks.

A

Are scaling the RL and robots so.

B

There's another theme and instead of.

A

The reinforcement factors, the diversity, the.

B

Diversity seems to be really.

A

Important to reinforcement, learning agents and.

B

That sort of, like the diversity between train between agents, when you're training and.

A

B

Sort of see diversity.

A

This is the case, but diversity.

B

A

B

Like learning of skills,.

A

A

B

Lot of this hammer is trying.

A

To minimal minimize the effort, the.

B

Human effort to teach these these skills.

A

A

B

A

Skills to the robots, one.

B

Of the interesting things, learning was analogous to learning.

A

B

Like you've got a little agent.

A

New to do he's walking through the environment and he gets caught up in like in a hole and it just stuck right, you're just stuck in the hole and and what do you do you reset like? You have to basically abandon your state and reset to an initial state, so you can continue, and this wasn't when you reset like this- it's not continuous learning right, oh I'm. Sorry, thanks for I can fix that. I can fix that how's.

A

That is, that better I should have all right, I fixed it before you guys, even noticed, okay cool. So this reset free learning there same as continuous learning, but it didn't seem like continuous learning to me at least not like what I would say is online learning. Some people call continuous learning the same as online learning you still the the reinforcement. None of these reinforcement learning systems were necessarily online.

A

It wasn't. The models had to be retrained over. You know over and over new environment, new new features of the environment, retrained rangering and it's a massive amount of training. You know to get to a point where you have a model that can can run through a test environment again, but there was some effort to try and mitigate these resets, and this was mentioned in a couple of the presentations exploration was defined as a robot moving objects and getting some updated image caption.

A

It's not the way, I define exploration, because I feel when it when I talk about exploration, I, talk about I, think about online learning because to explore and and when they talk about exploration, they're talking about training, a system to explore, not not like inherently exploring and updating its model immediately. None of these things that they their model immediately when they, when they explore it's all like a policy. It's all, that's all some of the policy has like exploration baked into it because it was trained to do so. I didn't quite understand this.

A

There's this there's a paper about this Leave No Trace thing I'm trying to get the you think of online learning is learning while also outputting yeah, so they're online, yeah, I, guess so yeah, while continuous learning to be an offline system. That's only learning, but not now putting anything yet that could be all of these reinforcement. Learning systems are outputting like they're taking actions they're moving through an environment.

A

I would say that they're they're, the I, went to all these deep reinforcement learning talks because because I'm interested in the movement aspect of of them, because they have this loop, where you take an action and then the state changes right, you take an action and the environment changes. So that is familiar to me because that's you know the the brain does the same thing. It.

B

A

Action and then and then the sensory input is updated. When you move, you see yourself move you see the environment change so I like that sort of similarity to HTM. So, okay, so the Leave No Trace thing I didn't quite get. But if you know what a cue function is and in reinforcement learning or like cue learning, it's like this huge cue cue as in like a big buffer or something of all all of the previous states and actions and states that you've had so you can sort of look up the best. The best thing.

A

It's a cheat honestly. It feels like a hack to me. The whole cue function, things I have but they're using the cue function to learn probability of a bad state and then go back to initial state I. Think that was a way to avoid these resets right. So so before you get to a place where you need to reset we're. Resetting would be expensive in a way because you'd have, to you know, cut clear, cut off your progress and go back to an industrial state like recognizing a bad state and then backtracking.

A

Somehow I did not even quite understand that honestly, so I did take away that it's difficult. This idea of exploration isn't is difficult. They I think they talked about this word, empowerment, which I think was just a term they used in one of their papers, which is like the ability to predictably change the future state of the world like take an action that changes the state of the world and know that when you do something you're going to change the world and they can.

A

This seems like some sort of an aspect of a policy that adds exploration to it if you're empowered, it's sort of the next a way to explore the state space. The summary talked about, rewards, resets and exploration. Okay, next talk, so this was a deep dive. This was an introduction to deep reinforcement, learning which I thought I could use. So this is sort of just the basics of reinforcement, learning by Joshua I, don't know how to say his last name and open AI.

A

So this is just probably some basic stuff that I'll go over that because I'm still learning about this stuff, too reinforcement learning is useful when evaluating behaviors is easier than generating them, so reinforcement learning it seems like the trend. Is you have an action space right? You're? You have to define that action, space ahead of time or whatever task whatever it is. You want your agent to do in whatever environment it is. You have to define this action space. Somehow you you're, not. It doesn't automatically generate new actions.

A

It's you have a space of actions that you can take now. This could be a very large action space, but that really increases the amount of computation it's necessary.

A

So but but none of this stuff is generating actions you're not creating ways to new ways to interact with the environment. You're you're, you are. You have like a library of actions that you're selecting write a policy. You know it's a reinforcement, learning policy which they denote as PI and all of the equations is I.

A

I honestly do like I sort of know what it is in my head, but I, don't I, don't know it well enough to define it well, but it's, but it's I would just have to google it honestly, okay, so I'm trying to I'm trying to decode some of the terms they were using when they say stochastic.

A

They mean that it's a probably probability, distribution of actions that you get returned that you select from versus a deterministic system or deterministic agent, where the policy maps to an action so I think the policy is like, given the state, what action am I going to take, and so you can have a stochastic policy that gives you a probability, distribution of actions and you look that you might want to take and a deterministic one.

A

There is also a term they used called trajectory, which is a sequence of states, actions which is I, think you know what gets stored in the Q and the Q learning, but they call this thing, a trajectory which is just like it, just a sequence of states like in order, and they also call it an episode or roll out. These are all new terms for me: the reward function I generally new. It's like it's. Basically, it's a function. You run to give you. How good is your stay?

A

Are your state action pairs and this can be for like right now it can be for this action or it can sort of be a measure of cumulative reward which they call a return.

A

Okay, so in reinforcement learning we want a policy that maximizes the expected return so you're. Actually you want your reward to be high. Essentially a value function will tell you something about that expected return after a state or a state action pair.

A

So how how valuable was that it wasn't totally I didn't totally understand that I like this this this graph here, it's sort of I know it's not super easy to see, but there we go so breaking up the reinforcement, algorithms into model free and model-based reinforcement, learning, here's Q learning, which I talked about a lot of the interesting things. It seems is happening in this deep, deep Q, our DQ ends or deep Q learning networks and then there's all these other things. A lot of them think they called PPOs.

A

That was really I, don't know, even though that stands for something about policy, I'm sure optimization, some type of policy optimization and then the model based stuff was they didn't talk nearly as much about I, don't I think that's harder. It seemed to me, like the model based stuff was harder to figure out so most of the work that seems.

B

A

Be is in model, free reinforcement, learning, but essentially you've got this cycle. You you, you run the policy, you evaluate the policy and then you improve the policy and rinse and repeat over and over and over again that's sort of the the.

B

A

Of reinforcement, learning the value function is an approximation. Value function is actually some other neural network and it gives you some approximate value for your actions.

A

So this is a direct quote from Joshua learning. Models is a really hard problem and I like to point that out, because that's where I think the big opportunity is for HTM anis, with association with reinforcement learning is how can we use HTM models, along with reinforcement, learning, because I think we have a good mechanism for creating models as sensory motor integration evolves and we and we have a way to generate movements.

A

Order at least become have some cause causality in in the movement space, and this is one of the problems, because reinforcement learning isn't good at generating movements. You have to have like this action space. So I don't know how to how to resolve this, but so that's one one take away. I didn't know that was the case.

A

You need to do predefined your action space and with HTM I think we might be able to help without if we have a predefined action space and we can contribute to some how we can have get a hook into that value function of the policy or something so that we can contribute to that action that policy to decide what action that's going to take so that the model can then update based on whatever happened.

A

Maybe that's an opportunity for us also reward function. Design is apparently very hard, and a big takeaway is that this stuff is route still really new. I mean this guy's from open, AI and he's saying that most of these deep reinforcement, learning implementations first of all, there's not very many of them and they're all tuned for research. So, if you're going to try and deploy any of this stuff in production like you're, basically, it's agree.

A

It's you're on your own you're, probably going to need to put a lot of research into it and and figure out some hard problems yourself also tuning. The hyper parameters is also very hard, so that was those are my takeaways from that.

A

Okay, um this is from Google brain Raw, Google representation in Google brain there's a deep mind. There was so learning abstractions with hierarchical reinforcement. Learning. This is sort of interesting um now what they mean by hierarchy. Is that say you have an agent and it's got legs. You know so. First of all, the reinforcement learning actions space for that agents.

A

Locomotion involves things like move, this joint 30 degrees out or move this other joint 20 degrees in you know, that's the sort of action space granularity, we're talking about and the lower level of the hierarchy now they're talking about another hierarchical level, that's more concerned with navigation through a bigger space. So say you have a maze and you know you need to get from point A around some obstacles to point B.

A

First of all, you've got to deal with locomotion, that's sort of the lower-level parts of the hierarchy and the high level of need is like okay, well, I need to move right. You know that I need to move up and then I need to go left right.

A

So that's what they're talking about when they're talking about hierarchy, high level versus low level, so some of the things I noted was that these high level, this high level, excuse me high level part of it operates at a lower frequency. So it's easier to learn in the space and exploration is easier because you don't have to worry about the details of locomotion. If you just off, you know to talk about you put that off to the lower level of the hierarchy, then trip with foxes.

B

A

Understand at the reinforcement, learning, predefined rewards and action spaces has always been the biggest problem. The space has become infinitely large and we're trying to hard code this stuff, essentially trying to account for anything and everything that could happen ahead of time yeah, and you can't do that right. I mean your environment is going to change in ways that you will not anticipate and I. Think we I think we understand that okay, so so one of the things he said that I didn't agree with was he said we humans don't explore by just flailing.

A

You know and he's talking about having a sort of a high-level policy that that does navigation versus the low-level policy. But that's not quite true, like we learn how to explore by flailing what reviews, babies flail and that's how they learn how their limbs locomote, you know, move through space, so there were a few quotes about the brain and analogies to human development and stuff that were we're. A little bit off. I would say versed on how I think about the brain.

A

High-Level abstractions must be created, so so this is so. You have the hard-cooked. This stuff, it's not it doesn't just learn. High-Level abstractions Jeremy, says haha should say you should hear me trying to learn a Bach Prelude yeah. Definitely some flailing any, and it's true. You know you do flail. You do flail when I'm learning the guitar one of the things that they tell you to do is meander and and meandering is just exploring it's just like what happens. If I do this, you know cuz you already when you do that.

A

You've learned that's sort of the flail like I, don't know. What's gonna sound good until I, try it I'm.

B

A

Trying to code my first proper machine learning, implementation.

A

Okay, so so the big decision in designing this system with a high-level abstraction is what are the abstractions. If you're talking about navigation through a maze, you might decide, move right, move left or just Northeast South turn some stuff like that could be the abstractions and then the low-level is just the locomotion required to execute that movement.

A

You know the high-level movement, but every system is going to be different if you're, if you're modeling a hand or something that, with a completely different navigation system, you might have totally different high-level abstractions like, for example, open and close your grip or and for a hand. There's a there's like a huge array of different grips that you can do with your hand. You can look it up. Like google, google hand grips, there's a ton of them and that's just from you know, inspecting humans and and how they operate with different tools.

A

There's a there's, a lot of different groups you might use. So all those high level abstractions have to be hand coded essentially for for whatever system you're creating Google conditioned hierarchical, reinforcement, learning I, don't remember! Writing this low-level goals. Just need to figure out how to accomplish high-level steps, yeah and I. Don't remember why I wrote that so I think some of the some of the open questions is: how do we learn efficiently?

A

We they're saying that these high-level policies must be modular like move left. It must be something you can apply wherever you're at in the space that can't be dependent upon some some state I think so the hierarchical or the high-level training depends on the ability of the low-level policy. So there you can't, you can't have high-level goals that the low level policies can't achieve, for example like if you want to move up some stairs the low the there has to be a low level policy. That knows how to climb stairs right.

A

High level. Training must be on policy. So I'm still trying to understand this idea of on policy versus off policy. I wrote a definition of it later so we'll get to that. So I'm not going to go into that right now, but you can't go off policy for the high level training because it's too inefficient and it's unfeasible in the real world and they did say a little bit about off policy Corrections. But it just seemed really happy to me so I just say it was a hack.

A

All right. So next up is is Jeff Clun Jeff Clun works for uber AI Labs, and he talked about something called poet which.

A

It's hard to explain he's got a whole, really long presentation at ICML, 2018 online. You can go look up that talks about map elites and poet, but this is basically all about goal. Switching- and this seems really inspired is this is all about diversity.

A

Once again, it calls equality, diversity, QD, algorithms and- and these really have like an evolutionary flavor to me but the, but it seems like what it is, is a way of jumping around some space of solutions like not just looking looking at one specific area but being able to jump from from one's from the search from one location in the search base of solutions to another location. And this this is not. This did not seem very brain-based to me, but it was more about like the evolution of organisms of aura of cultures.

A

Even so, he relates us to use the term called adaptive radiations like, for example, there's different types of fish in different ponds in Africa, and they all have adapted specifically for their own environments, but they all came from a common ancestor, you know so like, but they but they've all become very efficient and there's different of different areas. The computer was another example that we used of adaptive radiations. You know we start with one basic type, not basic but like very specific types of figures, but now there's all different types of computers. Doing specific things.

A

Excuse me, you talked about some paper or some something called go explore. It's solved, Montezuma's Revenge, which is a really and they use this idea of quality diversity. Algorithms.

A

These are all open-ended algorithms, meaning that they will continue to improve as long as they have more things to train on. When we talk about alpha star, that's all we'll talk about open-ended algorithms as well, so this poet, what was the poet? What did it stand for? I forgot paired open-ended trailblazer, so this is some framework that he and some colleagues created that periodically generates new environments and then it optimizes on it optimizes in one environment and then we'll like systematically generate new environments and I.

A

Don't think these are completely generated from scratch like there's some hard coding in there, and then it will transfer its learning, so it actually transfer weights from what it learned in one environment to the next environment. So taking with it. Like here's. Here's an example: here's like you've got a little agent he's walking he's learned to walk across a flat space. Okay, so let's take that transfer it to a new environment. The new environments got flat space, but it's also got these little stumps.

A

So so the the agent now knows how to how to navigate through flat space. But now it's got to learn more about the stump. So so it's it's sort of a way to separate learning about different aspects of different environments. You might have another environment, that's that's, got rocky terrain, another one! That's got pitfalls or things you don't want to fall into, and so you learn all about one environment. That's sort of the idea, the idea behind this. You learn all about one environment. You transfer it to another.

A

You learn about that when you transfer to another and then you can jump sort of back and forth and try different environments and try and take these these. This knowledge transfer- and you do this like all in parallel and try and find like the sweet spots in the search space for your agent. So that was interesting. You had a lot of good graphics. You should try. Look it up. Look up, Jeff clone map elites or poet, and there's he's got talks online about like long talks.

A

He only gave a 25 minute talk, so I didn't get much. Okay, another Google brain talk, understanding how value predictions shaped deep representations. So this was something about dis, dis, just distributional, reinforcement, learning and auxilary tasks.

A

I did not understand this talk, so I tried, I, don't understand these drawings that I made I was just trying to follow them. I did not understand what this. What this was all about, so fail big fail. This was interesting just because it's Starcraft, you know I like to start craft so from deep mind alpha star, which is mastering the realtime strategy game, Starcraft 2.

A

So some of the challenges that they said about this is in StarCraft. This is a complicated game, so I mean this is impressive. This is really impressive. There's a hidden information in StarCraft because you only get to see what's around your troops, the rest of the map is clouded and you won't. You only see the map as you move things through it and their explore, and there's this huge action space, so they've defined ten to the eight action action space, as you can do so many things.

A

So here's sort of the architecture if you're interested in the Alpha star architecture we've got the core of it- is a deep Ellis TM system, but they've got all these other deep networks, maybe not deep, but at least neural networks, there's a resonant here as a feed-forward net and then transformers, and so this is so those are. This is highly tuned to Starcraft by the way.

A

So like this isn't this would not be easy to transfer to any other game, even something like Warcraft or even maybe Starcraft one I, don't think you'd be I'm sure you wouldn't be able to transfer it. But this you have these ideas of spatial observations, economy, observations because you're, always building things and you've got materials that you're trying to optimize units that you're bill, but at the core of it is a deep Ellis TM system and, and what comes out of that is move our attack or mine.

A

You know things like that, so that was interesting. So what they did is they train again. Diversity is important in this. They created a single alpha, star, League and so they'd start with human game definitions right that they got from Blizzard yeah, so I'll get to that I.

A

The action space I, don't know how they define it, but it's totally hard-coded for Starcraft 2, so it would be like given yeah I, don't know because you can, you can select any unit and move them to any place or to give them at an action in any spatial location. I have no idea, but it's I guarantee you it's highly tuned to Starcraft.

A

So the way they started training these- and this is a massive amount of training they. Obviously, if you don't know this, you know alva start be the best Starcraft 2 gamers in the world over and over over then 10 times out of 10. So it was a big win for AI, but they started by getting human replays from Blizzard.

A

So Blizzard had some information about humans playing the game, and so they initially trained on humans playing the game and that's how they got sort of their seed for for alpha star to play on, and once they got some agents that were trained on human players. Then they would start and they would create new agents and they would train them to beat those agents right and then they create diverse representations of those agents and every agent they make the goal would be to beat all the previous agents.

A

So it was a ton of agents and they would encourage diversity which they said was crucial. They had to do this if they did encourage diversity, I, don't think what it worked so the way they did this was they would give the different agents slightly different goals. Okay, so so like for one agent its goal, it would be rewarded for beating all of the other agents in the league for some of the agents.

A

They would reward it just for beating one or another, like particular agent, because it would develop specific strategies just to beat that one agent. It wouldn't attempt generalize its strategy across the whole space of agents, and that would inject some diversity into the training environment. They would also reward some of their agents for building different types of units, so they would, they would hard-code some reward and make some that would would get more reward throughout the game for for building particular types of units or mining, particular types of resources and stuff, like that.

A

So again, hard-coding things for sure. I do not know what the Nash strategy is. I tried to follow that, but I didn't I didn't, but it's some type of probability, distribution over all of the agents. That's optimal. The interesting thing is to be for alpha start to beat these human grandmasters. They trained over 600 agents. Right from that, in that scheme of everyone has to beat all the other agents, and then they create a new version of it.

A

It has to beat the other agents each one of these agents went through more than a thousand years of in-game training. That's six hundred thousand years of training, which blows my mind. You know how much compute power is that that's crazy, but that's what it took to beat these to grant these grandmasters.

A

That's what it takes to reach human level capability in StarCraft to six hundred thousand years of playing against yourself and alphago is the same way, a massive amount of training to get these and where you're talking about these statistical probabilities, because all of this is just probabilities and and trying to learn these probabilities by seeing so many different examples of things.

A

Okay, so the each one of these agents, iteratively learned from all the previous versions, and so this was sort of interesting as they watch the evolution of these agents. They saw that initially the agents would would expand their bases and that would win for a little bit, but then the next sort of generation of agents. Some would be more aggressive, because they've learned that being aggressive, you could take over all those bases, so being aggressive was then rewarded right.

A

So then, the next generation of agents were rewarded for being defensive because they were being attacked all the time. So so you go through sort of these evolutions of strategies and to the end where they're you know at some point after you've developed your your you're getting defensive, you realize well now I need to go scout and see where I'm going to be attacked from and by what so there's sort of this evolution of strategy over time, as these agents are constantly fighting each other and trying to come to the best solution.

A

But some interesting thing is things is that alpha start did things that were really interesting to the players that it was playing like it would hide units it would.

A

It would put a unit close to an enemy base, but just out of like I reach, so it couldn't see that it was there, and that was surprising to one of the grandmasters and it also it learned how to Scout, basically by evolving from just expanding bases, to being more aggressive and attacking to then being more defensive and then realizing well, I have to go Scout if I'm defenses that's more rewarding to to scout.

A

So the big takeaway from this was you should encourage diversity in your reinforcement, learning agents by allowing them to have different goals. Okay, there's a long day, okay into the afternoon, injecting structure. This was an Nvidia talk, injecting structure for generalization again generalization in robot manipulation. So this guy did his examples. He gave like a video. You remember, Rosie the robot from the Jetsons. If you're my age, you probably, but it was the house cleaning robot, basically with attitude, she had a lot of attitude but Sam's attitude.

A

He wants to build robots that are able to basically do lots of different things in unstructured environments. How to so the big. The big question is: how do we generalize in all these unstructured environments? I'm, not sure I took a lot from this. This is a weird I took some weird notes, because I was planning on talking about control because he had this library much control to planning the perception, but then that didn't go anywhere. So so I just made notes about visual motor skills. Diversity of skills.

A

Can we build representations that can transfer to similar tasks? Questions he's asking he's talking about sensor, fusion and and the need for it to create a general representation sensor, fusion, meaning you might have torque information from a robot alarm. You might have a camera information and one of the nice things that this adds some robustness, because then you can interfere with the camera and it can still do some things because it has has other Infirmary information coming in.

A

They always seem to start not always, but he he mentioned that, starting with some random policy.

A

Representation transfers between track tasks- that's a challenge- I mean I. Really haven't nobody's really solved this yet, but you want to be able to do, is learn from when you're vacuuming and and how those actions can be applied to other tasks. Other goals that you have but and currently all these policies need to be relearned when you're, when you're jumping between tasks, model-based tasks, space control, I, didn't know what that meant.

A

So I wrote it down the main takeaway from this, for me, was action, representations and self self supervision provide structure and- and he had more on that slide, but he switched so fast. I couldn't write it down. This guy went really fast through his slide, so it was hard for me to take in the whole thing hello mark brown I'm from I'm more than halfway through my recap of my my conference day, I learned a lot about reinforcement, learning and boy. Are my arms, tired, quantifying generalization in deep reinforcement? Learning again generalization was an open.

A

A I talked lots of talk about generalization, it's a big problem in reinforcement, learning and nobody's got good solutions, quick they're. All admitting like we need better benchmarks for generalization. We need to figure out how to how to how to do generalization.

A

This open, AI has some system has some platform called coin run, which is a game platform and what it does is generates an infinite amount of levels for training which is which is beneficial because it because you can help train generalizing. You can help too. It gives you an environment where you're forced to generalize right sort of, but unlike sonic I mean because that when they did the sonic the hedgehog thing at an open ai, they only had like 50 levels. So you can only do so much with 50 levels.

A

Red fox says relearning for different tasks, a common theme, yeah deep learning is do incredible things, but we're not really any closer to AGI. I, agree, I, agree. Okay, large training sets are better, obviously for deep learning. Large training sets are better deep. Architectures are generalized, better yeah file that under duck agents can over fit to a large number of specific environments, so I nothing mind-blowing. Out of this talk: okay, okay: here we go into off policy versus on policy. So here's where I wrote down I, don't quite understand on policy versus off policy.

A

Yet I'm gonna need a night to sleep on it, I think so. This is called off. Policy reinforcement, learning for real-world, robots from Google brain on policy means you can only train on data from your one agent from one agent or current agent, and that data is not reusable for new environments. So that mean I. Think that means that policy is tied to an environment and an agent. So when they say off, policy they're talking I think they're talking about learning transfers, something like that.

A

But you can train one agent on another agents, experience and possibly combine data from multiple agents experience into another policy, so it lets you run or lets us train reinforcement, learning models without robots in the training loop, this guy on Google brain they've got this thing: what they call it the arm farm, they call it the arm farm at Google, and it's just like a bunch of robot arms I saw this in a couple different presentations where there's just a bunch of art, robot, arms and they're, all like just reaching and grasping things.

A

You know all they're, basically, all training and they're and they're all collecting this data so that they can create reinforcement, learning policies off of it. And if you do this off policy thing, it lets you train these reinforcement, learning models without having robots in the training loop, which is great because robots in the training loop are expensive.

A

So on policy is good for specific environments like Amazon warehouse, I, guess as long as the environment doesn't change it's as long as the agent doesn't changing the environment. Doesn't change. There's a lot of a lot of talk about q-learning this. This guy talked about a specific type of q-learning called QT ops, which is so they were using on butt. So they would.

A

They would train they'd have an off policy system, but they would use on policy to fine-tune things and that would get their accuracy from like 85% to like 95 percent or something so they would. They would only use a few robots. They need much less robots, essentially to train a reinforcement, learning system to do something. So this seems like optimizations that will law on policy for soft policy thing, improving QT apps to use less real robot data using a simulation and trying to transfer that learning world.

A

So the difficulty and off policy evaluation is that old agent behavior does not equal current agent behavior. That was my takeaway. Is anybody? Remember Zork big king I'm, looking at you, so this was about reinforcement, learning in interactive fiction game, so I thought it was interesting because I used to play there was a Hitchhiker's Guide to the galaxy text-based game. That was my first experience with text-based games with with and I thought was really fun. So this you talked about Zoran, which is an entirely text-based control and state sort of game yeah. But it's like.

A

Basically, the game is a text thing and it says you wake up in a field. There's a white house to the west and there's a mailbox in front of you. Then it's just like I've got a cursor and then you have to type something like open, mailbox or move west, or you know sleep or something like that right.

A

B

A

Yeah so the idea, the motivation behind this one of them anyways, is to better process voice commands because, typically, you can only give like one command at a time, and these text-based processing systems are really brittle, like you can't say, do that you can't give it a complicated sentence and expect it to understand it.

A

Current voice assistants are not reinforcement, learning. That was one thing he said so when you're talking to Alexa or Syria or whatever. That's not, reinforcement, learning, that's just deep, deep neural networks, because apparently they're too costly to train and they still need to study how they work.

A

Hello, hello, maverick watching, spider-man, okay, sorry and on subscribe, and you won't have to chop down door, never worked yeah. So you know I'm talking about you played that you played Hitchhiker's Guide. That was great. That was when I was a really intriguing game, because I played that game. Before I read Hitchhiker's Guide to the galaxy, it sounds really young, and so it was even more intriguing because I did not know the storyline or anything is the know. The mics not turned off you guys.

A

Could all hear me right, yeah I, don't know, Mike turned off I, don't know. What's going on with your with your audio okay?

A

So so, when you're dealing with text, one of the things is it's a huge action space, because when you are dealing with like a game like pong or whatever at our games or whatever or even PlayStation, you have to be a buttons and there's only certain combinations, there's a very finite combination of buttons and actions that you can take with that keyboard, pedal with that button pad that keypad. But when you're dealing with text, it's a huge action space because you're entering phrases not buttons.

A

So that's one of the big challenges with reinforcement, learning when you're dealing with text I, don't know why that you can't hear. But it doesn't matter because you can hear me say that so.

A

Let's see must return a history of must retain a history of states, that's one of the things like restrictions, so he had this general game playing agent that he constructed at Microsoft called nail. If you want to look it up, it doesn't perform very well on any one game, but it performs decent like novice level across 20 different text-based games, so that was interesting right.

A

Okay, there's. He also described a couple of other algorithms, aside from that that were text-based, a star search which, which is has the most handicapped handicap being you're, giving it a lot of a priori information like all of its actions, were predefined, specifically for Zork right. So in order to get- and this did really well because it was finely tuned to Zork- it's not generalizable at all, and it also had this ability to travel through time, which means which is like a replay.

A

You know, if you get to a point where you fail at the game or you die, you can just step back step back step back and then retry right, so that this has a the ability to do something like that. It baked in to this system the next thing and I heard this idea of this actor critic model and reinforcement, learning I, don't know what it is, but they call this A to C or advantage actor critic.

A

Single policy, but multiple parallel environments right so you're running a bunch at the same time, using the same policy again it has a fixed action set. So this was again specifically tuned towards each game, but there's no time travel now the nail was: it was his thing: I think it's open source and you can look it up.

A

Microsoft search for Microsoft nail and navigate acquired, interact, learn so, first of all, there's lots of hand coding of things and there's hard coded external memory, but it's but it's focused on text-based games, so you can play 20 different text games on it, so it's not walk the MS or it's just locked into text-based games and it performs pretty poorly across all the games, but it can at least perform on the games.

A

Okay, next talk the hidden risk of blind, a a blind trust in a black box. I talked to this guy. Aj I didn't really get anything from his from his talk. You just talked about what they call explainable AI.

A

Nothing too interesting. There I talked to him about what we do in the Mensa cuz. He has talked about things about the brain and he was super interested in what we do, but I did much from his talk. Okay, what are checkpoints of yeah check checkpoints, meaning like time, travel, meaning that you can a checkpoint? Any any action you take is a checkpoint and you can go back to that state and try again, you know, like think choose-your-own-adventure.

A

Every time you die you're just like well I'll go back to the page. It was before and try something new yeah with what Mark said: okay, so from word embeddings of pre-trained models. This is a talk from Amazon. This was just sort of a recap of recent history of NLP developments. Word of betting's, we talked about word to back a lot of fast text.

A

I mean just look up word embeddings and these technologies. One thing I noted was: you talked about this thing called Elmo, which was, and she referenced a blog post by mihail Eric, who used to be an intern in dementia, but he currently works for Amazon and for Alexa, and there was this other thing that I called Burt that I guess Google has created by directional encoder representations from transformers it's another type of language processing. You know P type of thing: I, don't I, didn't I, don't know much about it.

A

You can go too far into it, but I saw someone else to talk about it. One of the vendors was talking about how they could use Burt, but it's it's based on word vectors. It's basically word vectors that, because word vectors word embeddings, excuse me. They don't give you a good context of words so like when you say: I eat an apple or I use an Apple computer. It encodes those word embeddings the same way, but Burt apparently has different betting's for different contexts.

A

Quick, sorry, man, Alexa, destroy, mark brown, look at work, we'll, never know. What's sorry, I'll stop I'll stop! That's it! That's it. That was the end of okay. So so, if you a little recap, let me turn this to there we go now. You can look out my screen.

A

Sorry I'm, not sure, of course, I'm not sure what you mean, how wait wait? I, don't know if this is gonna work, oh I, don't have. That sounds weird. There is one. This is sorry Dave, that's right! Sorry, Dave, oh you're, not gonna, hear it. I don't have it set up on my laptop anyway.

A

Okay, so the the talks were informative, I think I, know, I, understand more about reinforcement. Learning now, which is great, so I, feel more informed. It was an exhausting day honestly I got up at like 5 a.m. and drove up to San Francisco.

A

The the conversations I had were interesting because anybody that I talked to and I probably talked to four, let's see one two: three: four: five: six ten, maybe ten, ten twelve different people like in depth about what we do at noventa and and how the brain is the way towards AGI and how important it is to understand the brain. None of the stuff that I saw today was is truly like brain inspired in the way that we talk about being brain inspired and when I did talk to people about what we do here.

A

People were excited about it. They wanted to know any more information. I probably gave out 20 different business cards. So that's good. That's good I mean I. Think people realize that perhaps this is something we should be paying attention to am I still streaming because it looks like it looks like.

A

A

Can't tell but I think this stream died, crud.

A

No I'm still on its hotel, Wi-Fi, guys, Hotel Wi-Fi, sorry so anyway, I talk to a lot of people and and when I start talking about how we think the brain works and how how we think movement is so important and how we need to build up sensory models, sensory motor models of reality and and and it's a totally different sort of language than everybody else is talking about and and the people I talk to.

A

We're excited about that, because I think a lot of people realize that deep learning isn't going to lead to a gif. In fact, I took a I took a picture. I took a picture that you'll like.

A

A

They had these little, these little things sort of stuck around and so here's here's an example of it. I don't know if you can see that will AGI ever be reached and you've got these stickers and you and you could put the sticker wherever you wanted it to, and everybody said we're getting we're a long way off.

A

Yes, we'll reach AGI, but we're a long way off. So that was encouraging to me, because that means people realize that what we have today is AG is not around the corner and I think the hype cycle prints promotes it like it's around the corner and it's not so all right, that's it. Does anybody have any questions before I sign off? We need to figure out what general intelligence even means. I I have the I have a good idea. What it means in my definition of intelligence is the.

B

Ability to move.

A

Through reality, move through your environment and get given the feedback of sensory information understand how your actions affect the environment.

A

Capsule was was it in the air they're out capsules? One person asked me about capsules. Nobody talked about capsules and the only reason he talked about capsules is because I was talking to him about locations and and sensorimotor and object modeling, and he said. Is that anything like kittens capsules and I'm like yeah and then so.

A

I referred him to our YouTube channel with our current research meetings, where we were talking all about him, his capsules, but but not a topic, not a technical topic in any of the talks that I went to any of the conversation that I heard so.

A

Yeah but yeah hidden is hidden as a genius man, so it's hard to put it any other way, he's certainly really ahead of the curve. Oh so intrepid Fox, all the Monday meeting about capsules cool that was that was pretty interesting. I'm really happy to see us like sort of reviving those things and I heard Jeff say several times. Man I wish we I wish. We would have cited him because he hadn't read those papers and after after, like Marcus, gave a recap of all these papers, Jeff was really impressed. It was like this is.

A

He was like he was happy because it gives us an indication that you know we're on the right track. If, if this is what Hinton has said such a long time ago, it feels like well if we're coming we're coming to the same conclusions independently. That's a that's a good thing right! That's a really good thing that that's an indicator that we're doing the right thing.

A

So I'm happy about that, because our goal is is figuring out, intelligence right, great, well, I'm glad you did and if you haven't been intrepid, Fox go to the forum and register register there. Let me know you send me a PM or something. Let me know you came from twitch or appreciate it.

A

Yeah capsules are all about hierarchy, but there's some things that we can I think take away from it. Even without the hierarchy. Yeah.

A

So all right, any anything else: I'm gonna go get some Pakistani food, while I'm in the city gotta go get something ethnic and I saw one right down the street. That's my plan.

A

Get some rest I got another day tomorrow, but I'm not going to stream tomorrow, because as soon as the conference is over, I'm going to drive home I think well the way the way Hinton puts it is the capsules the same level as a mini column, or at least he makes that comparisons of capsule as like a mini column, I'm, not sure how correct that is.

A

Are you guys, I'm gonna, head out thanks thanks for watching dub dub dub dude I appreciate the time no problem I'm happy to do it as long as I got viewers that are interested, yeah yeah go ask questions on the forum. I'll check them out later take care I'm. Stopping the stream have a wonderful weekend. Monday, oh by the way, Monday we're gonna do active duty cycles. Spacial pooling active duty cycle is it's going to be cool I? Have some good visualizations in store for Monday all right, all right, ticker.