Numenta Numenta Research Meetings, 2 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Steve Omohundro on GPT-3

Description

In this research meeting, guest Stephen Omohundro gave a fascinating talk on GPT-3, the new massive OpenAI Natural Language Processing model. He reviewed the network architecture, training process, and results in the context of past work. There was extensive discussion on the implications for NLP and for Machine Intelligence / AGI.

Link to GPT-3 paper: https://arxiv.org/abs/2005.14165
Link to slides from this presentation: https://www.slideshare.net/numenta/openais-gpt-3-language-model-guest-steve-omohundro

A

Okay, so I think we're recording, now yeah, okay, great so I'm thrilled today to have one of my favorite scientists on the call today, Steve Omohundro, so Steve has actually influenced Ament in a couple of different ways. There's most of you had moment already know. Steve was my PhD advisor. While he was a professor at the University of Illinois urbana-champaign and then at UC Berkeley. So that's a Numenta connection.

A

He was also the thesis advisor for Bartlett mill and Bartlett was the first to propose that active dendritic properties in pyramidal neurons could exist and have a computational impact so that work on active dendrites, you know, have certainly impacted us quite a bit in our models. It's kind of hard to characterize Steve and the wide variety of work he's done over the years. He has a PhD in theoretical physics from UC Berkeley he's been a professor of computer science he's been an entrepreneur and has founded several companies.

A

He has created two programming languages that were widely used, starless, which was the parallel language used on the connection machine at Thinking, Machines, Corp and say there, which was an object-oriented language. He developed at Berkeley that I used for many years he's done quite a bit of work in cryptography, both in research and in industry, maybe more relevant to today's discussion. Steve has worked in machine learning, computer vision, natural language processing and AI.

A

Since the late 80s he's conversant in just about every machine learning algorithm, you can imagine, and more recently he's been spending a lot of time, thinking about and talking about, the implications of AI on society and the ethical implications of all that he's. Currently, the chief scientist and a board member at AI brain, which is a company creating new AI technologies for learning conversation, robotics simulation and music. So today's talk actually came out of a discussion I had with Steve a couple of weeks ago.

A

We were talking about gbd3 and it became clear that the advances that are going on in natural language processing today are quite startling. There are many interesting questions that are being raised and I thought Steve would be the perfect person to kind of lead a discussion on this at momenta. So thank you so much Steve for coming and talking to us today. Oh thank.

B

You so happy to be here and yeah. I and I were talking about the implications of GPT 3 and just it's hard to wrap your head around. What's happening just to sort of highlight that this morning it was announced that Google announced or released g-sharp. If you say, 600 billion wait model and there are rumors that people are working on trillion weight models. It's a we're in this period, where these things are just growing dramatically and so understanding like what the implications of that are both scientifically for intelligence and technologically and socially I.

B

Think it's super super important, sukhothai and I had a great conversation. He suggested hey, let's broaden, broaden it, and so I'm really happy to be here. I'd be happy to just do it interactively I put together some slides, but we can delve into any part. It's especially interesting. So I don't know the best way to you know just cut in and make a comment or a question. Oh.

A

Yeah feel free to drop in I'll, also monitor kind of the participants. If you raise your hand, I can I can call on you, but justjust it'll be easiest just to jump in probably yeah.

B

Great, can everybody see the slides there? Yeah, ok, great, so opening so the transformer model is the thing.

B

That's been really shaking up natural language ever since of 2017 when it's introduced and there are a whole bunch of variants of it and opening I produced a very simple variant that they call GPT and about a year and a half ago they came out with GPT, which was you know quite influential, and you know they didn't release the model because it was too dangerous to release and people have been using it, and so, just a few weeks ago they released this TP t3, which is you know ten times as big and super super powerful and it seems to be exhibiting some new kinds of emergent behaviors that potentially could be very important.

B

So that's what super Ty and I were talking about so I'd like to go through today. Give you just a rough outline of what the model is. What it looks like what is scaling behavior is where it seems to be going as these models get bigger and bigger. How it relates to all this word embedding sentence. Embedding document. Embedding ideas and the something called distributional semantics. So how does meaning come into this class of model and then a characteristic of these, which is very different?

B

Some people are calling its software 3.0, which is instead of you, know, writing a program or building a neural that model and training it. You instead give a prompt to one of these big models and it figures out from your prompt in English what it is you want it to do, and so it's kind of a remarkable thing. It's sort of hard. It's almost like magic. Does this really work? I'll show you some examples where it does so we'll go through that sequence and at any point you've got questions or comments.

B

Let's, let's talk about it. So gbd3 is a language model, and this just means it assigns probabilities to word sequences. So give me any sequence of words in English, though this thing is trained on a big chunk of the internet, and so it actually knows a lot about other languages too.

B

Remarkably, and in fact it can do translation right out of the box, but it's primarily English any kind of a probability model like this can always be factored as a product of the probability of each word conditioned on the previous words and the simplest form of language model is an Engram model. Where you, you know like a bigram model. Is you just take the two previous words?

B

You predict the next word based on that and you can build those models just by recording from a big data set, how often trigrams occur, and then you can use that to predict future words, and some people been doing that for years and starting around 2000. These Engram models started beating all of these very complex linguist design models, and so that was sort of I would say. Maybe one of the early wake-up calls that these big compute heavy data heavy models may supplant the clever.

B

You know human created detailed models like I, visited the psych project back in the 90s, where they had teams of linguists. You know detail putting all the detail about the world into these systems. Well, all that's been supplanted by by simple statistics. From that perspective, you can think of GPT 3 as a 2048 gram model, since it predicts the probabilities of the next word, based on a context window of 2048 tokens and here's.

B

What the architecture of the model looks like you take the context window 2048 tokens each token is mapped into a vector in a real vector space through this embedding matrix. So that's sort of the first set of weights is in. How do you map tokens into this vector space, and then everything else is done in vector spaces and the whole pipeline that it goes through? Is these 2048 vectors get transformed into another layer of 2048 another layer 2014 of earlier 2048?

B

This block here is the transformer block and in GB t3 there are 96 of these. So ninety six layers of doing this transformer thing and inside the transformer, the key thing, the the part that makes it magic is this thing called self attention and I'll say what that is in case. You haven't seen it and then there's some more traditional feed-forward networks in there also. So you get 96 layers of that and then finally, at the end, you have one more feed-forward neural net, and then you have a soft max which converts the weights.

B

It converts values to probabilities, and it's those probabilities are the probability the next token. So, given an input of 2048 tokens, it's going to give you a probability distribution over with the next token is, and from that basic thing you can do all kinds of stuff, and that seems to be sort of the style of the modern model. So what is a transformer? What's the self attention thing? So a convolutional network takes a sequence of input, vectors and the output vectors are weighted combinations of these vectors.

B

Usually it's the nearby ones get combined in some way to produce something. Those have been very influential in image. Processing a little bit in language transformers are similar, except that each of the input vectors goes through three matrices to produce of you vector a key vector and a query vector and the output vector is going to be a weighted, linear, linear combination of the value vectors of the input and the way that you decide on the weights. This is the attention piece is the vector you're interested in its query.

B

Vector is dot product with the key vector at each of the other vectors in the input and those dot products are then normalized and serve as the weights to combine the value vectors. So it's basically just a linear combination of the value vectors but where the weights are determined by this self attention mechanism, and then these the matrices which produce this value key inquiry vector those are all learned in the system. The whole thing is trained through back propagation and to end to achieve high probability of predicting the next word on the training set.

B

So that's so sort of the most vanilla, autoregressive language model you can imagine with the one extra twist is that we use this sort of slightly odd attention mechanism in the middle of it and that's all it is just scaled up to be really big, and yet it does amazing things. Oh one other comment, because it's an auto regressive model. That means it's trying to predict the next word. From the previous words, the self attention is masked, so it only looks at previous words. There are other variants of transformers.

B

Bert is probably the most famous one which take a different approach and they can actually have self attention of context both on before and after with GB t3 only looks at the previous stuff I.

C

Have a question before you go out as you talk about this, it's all in the context of language, but this idea of this sort of self query attention mechanism. Is that something that you view of the field views as language or is that something a general capability that would apply to many different domains? Is it a language solution or is it a general purpose solution? Well,.

B

I think it's a special important language, because what it gives you is. It gives you long distance connections which language uses a lot, but people are taking this self attention thing and they're just dropping it unchanged in to say convolutional, networks for image, recognition and they're, improving performance and then open a I recently did almost exactly the same model for predicting.

B

You know you give me the top half of an image and it predicts the bottom half of the image by going pixel by pixel and it just sort of fills in the bottom half pixel by pixel, and it's remarkable the kinds of things that we've learned very high-level semantic seeming things. So it seems to be a kind of very general intelligence element that can be applied in all sorts of domains. It's.

D

Also kind of random I mean there.

B

Are several different alternatives for self attention that people propose this one is a very efficient to run on GPUs and so I'm, not at all convinced. This is the optimal way to do it, but it seems to it seems to work and they're starting to be papers now exploring the space of what are. You know possible self attention modules all right, so the the these models seem to do better as they get bigger and since they were introduced in you, know, 2018 or something they've been growing exponentially, and we see that trend continuing.

B

As of this morning's announcement from Google and basically every big AI lab has been building this kind of variance of this kind of model and applying it to a number of different different tasks: the GPT 3 architecture. They actually built 8 different versions of it with different sizes, so they could see how scaling was going. The biggest version is 175 billion weights and the pipeline. It starts with a reversible encoding of you know you would like to do.

B

You know conceptually you're just encoding words as vectors, but in these big data sets, you often see you know things that are in Unicode, or you see words that you've never seen before, so they they generally want a more general kind of encoding, and so something called byte pair encoding has become very popular.

B

Where you say: choose a certain number of tokens, say 50,000 tokens, and then you look for the most common pairs in your training set and you merge those into larger symbols, and that way you get you get common words get encoded as a single token, but also sub words get encoded so that, even if you see something new that it's made up of components that you've seen before you maybe have some sense of it, so that that's a kind of a hacky thing that a lot of these system is doing I'm guessing eventually, this will be replaced by something which uses learning for real and and learning.

B

But at the moment they do it. This way in a context window of 2048 tokens, they have 96 transformer layers. Each layer has that attention module like I mentioned, but they actually do 96 attention heads in parallel, so that different attention heads can discover different phenomena. So, like one attention head might look for. You know the direct object of a verb, another one might look for the indirect object, and so somehow it's going to the system through back prop, is going to figure out what to do with all of these heads.

B

There's a lot of work these days. Do you really need 96? Does it help to have more do different layers need different numbers of heads? All those questions are kind of unknown, I would say at the moment each of these heads produces 128, dimensional, vector and they're all discontented to form the vector for the next layer.

B

They have 12,000 280 units in the first of the multi-layer perceptron unit and 49,000 152 in the next one. They train it with a batch size of 3.2 million samples, and they have this as their learning rate. So that's basically.

D

B

That there are in making this thing, so it's quite generic really on the training data is 499 billion tokens. Some people estimated the cost of training. It would be 12 million dollars, though somebody figured out on the cheapest GPU cloud. You could do it for four million six hundred thousand dollars, but it would take 355 years.

B

You know substantial training tasks, though actually running it doing inference is not that expensive. They estimate just a few cents to generate on the order of ten pages of tax. Is.

A

Is that to do one training run on it? A complete training, one complete, and presumably you have to do it several times, just while debugging and getting big hyper parameter tuning.

B

Yeah well, in fact, if you read the paper, one of the worries is some of the tests that they tested on those we appear on the web and so might end up in their training data inadvertently, and they accidentally did have some leakage like that, but it was too expensive to go back and retrain it. It's that they tried to account for that in their reporting.

C

So I just be clear: they're actually spending that kind of money to build these models. Well,.

B

Open AI was started a few years back as a kind of trying to make sure that AI has a positive social benefit and it was a a nonprofit corporation, and you know, investment from Elon Musk well about a year or so ago, and they decided you know, one must left the board. They decided.

B

You know we're not nonprofit anymore, we're going to be a for-profit company and Microsoft invested a billion dollars into them, and it turns out Microsoft built the supercomputer that they use to train this thing on and I'm guessing there's some sort of internal accounting. Where Microsoft you know, donated a billion dollars.

B

But then some of that comes back to them for running this computer and I'm also guessing that the the Microsoft data center will be one of the first to use this type of model, and so there's some kind of you know agreements going on back and forth and that we also announced recently that they have an API to this model, which they're gonna start licensing out for end-users to use. So some kind of complex business model and research and okay.

C

That's helpful but in the end there is a lot of computing time. That's being spent, that's costing money. Yes, absolutely does that mean, so maybe it's a lot less than the 12 million dollars you mentioned, but it's going to be in the millions absolutely.

B

Yeah and it's both the equipment you're using and it's also power. This thing is using a lot of power in training, not in not in execution. So here's what the source of it is.

B

The bulk of the data is just web data, but they used reddit reddit rates things, and so they used highly rated things that are highly rated on reddit as a kind of curation for what web pages might be more valuable than others, they have all of Wikipedia a bunch of books and I think they in the Train I think the really high quality stuff like Wikipedia, they trained it repeatedly and the low-quality random web stop I think they trained only once that kind of thing, so some kind of choices made there.

B

Here's more details on the supercomputer, it's all based on NVIDIA V, 100, GPUs, okay, they have two hundred eighty five thousand CPU cores and ten thousand GPU cores each with a pretty high network connectivity on each GPU server. This.

E

Is a pretty big data form I.

B

Assume they're using it for other things, I don't really know and they trade it trained it using high torch model using the CUDA deep neural net, some version of it. You know how.

E

A

Yeah, how long it was okay in time. Do you know how long.

B

A

B

I haven't seen that anywhere, but I get the sense. It's on the order of weeks kind of I just presumed. It was less.

A

Than four hundred years.

D

B

Yeah and- and it's an interesting thing about you- know: academic research, labs, probably don't have access to this kind of compute power, and so this may be. You know if you know the future of AI models requires this kind of compute. That could be. You know, super tiny, alright, we're talking about the sort of trade-offs in versus China versus the US and building doing AI development. It may be that just Rock and view power is a really critical component, so here's their super graph of a few about a year or so ago.

B

They, they analyzed all of the transformer models and they found very scaling relationships, the most important of which is the loss of a model versus the compute power. When you optimize everything else, the amount of training data, the size of the model and all that- and they found this really simple nice relationship, and these show the curves for different sized GPT models. This is their smallest one, and this is the new one, the really big one and they.

D

B

To be lying quite accurately on this curve- and there doesn't seem to be any sign that that it's getting saturated so I would expect that, as we scale up, maybe to this google 600 billion weight one- maybe even you know trillion 10 trillion weight models, we're probably still going to be on this curb and we're gonna be getting at least in terms of test set bell test, set loss, just increasing improvement, just with with more compute more data still.

C

Have a very basic question: the term loss is that it cannot consider that, like.

B

Yeah, it's the predictive, you are of the training data and in fact a better more intuitive thing is perplexity and I'll clarify so perplexity. Is you know? What is your uncertainty about? The next word so you've just seen a bunch of words and the piece of text you're trying to predict the next word and perplexity is if you had a case that sided dice and the uncertainty about the next word is like throwing that case and advise the K.

B

Is the perplexity it's to to the entropy of the distribution of the next word, and if you just do unigram statistics, perplexity is about 962, so you know, if you just guess what a word is going to be. It's about a 1 to 9 to 62 by grams, gives you 170 trigrams 109 GPT two broke records in perplexity about a euro so ago it was 35.8. Gpt 3 now breaks a new record. Twenty point: five, it's not exactly clear what the human perplexity is.

B

How well humans can predict the next word and a piece of text, but various estimates from different sides suggest it's around twelve and if you plot the best perplexity over time on this particular test, the penn treebank, it seems to be decreasing pretty steadily and here's gbg3, and so, if this continues, it looks like it'll be about a year before these systems are at the human level of estimated human level.

B

It also saw you know, one of the things gbd3 is good at, as was DVD, two is generating text. So, if you give it the start of an essay, you give it some words and you say generate the next word then take that and generate the next word. Take that generate the next word. It generates pretty cohesive.

B

You know elegant essays on that topic and with GPT three they generate some essays and then they take human written essays and then they go to a bunch of humans to try and figure out which ones were computer-generated, which ones were human generated. So it's kind of a variant of the Turing test. Where there's no interaction, it's just you know the system generated some text and they've been finding the ability of humans to tell what was fake and what was real, has been steadily decreasing and with gbd3 it's essentially at 50/50.

B

Now it's like fifty two percent. uh They can detect the machine and so we're essentially at the point where these machines can generate text. You can't tell you can't distinguish from human generated.

B

So all of this sort of is is data that backs up with what rich Sutton who's a long-term AI researcher, who wrote the best textbook on reinforcement learning. He came out with an influential essay a year or so ago.

B

He called the bitter lesson and his bitter lesson was that a simple AI that leverages computer power will eventually beat out clever AI built using human knowledge, and he argued that that was true with chess, that people had all these clever, complicated, chess-playing programs and then deep blue, basically just did a fairly brute-force search and be fat. More recently, we've had alphago beat go using.

B

You know self play with some simple learning and some search to beat go, and then these language models are really showing that first with engrams and now these transformer models, especially that they seem to just be doing better and better and better as it gets bigger and then they're a bunch of other recent experiments on different kinds of AI tasks where it looks like scaling with compute power is just you know, beating everything. So that's sort of a tragic conclusion for those of us that really, you know, love to deeply understand things, but.

D

B

The maybe the reality and I think opening AI has taken that to heart. Their chief scientist recently quoted he says: give GBD to the compute, give it the data and it will do amazing things, and so I'm guessing their business model involves just scaling things up and it sounds like Google's doing that too. There's another way to look at these models to try and understand what.

F

Is it that they're doing like.

B

You asking this transformer this self attention unit is that general is that when is that and one of the really insightful ways of understanding? This is through word embeddings, and this has been kind of an industry in the natural language world for the last few years.

B

Initially they were just static, word embeddings, and so basically you wanted to take all the words in English and each word should be represented by a vector in some vector space and you'd like words which have similar meaning to get mapped to vectors which are near to one another, and that turns out to be pretty easy, but it turns out they actually capture a lot.

F

More relational information.

B

And I'll say a bit about that in a second, but you know, an English word has different meanings in different contexts, so the word Bank could either be a place.

B

You put your money or it could be a river bank whose side of a river, and so knowing the context that you're in makes a big difference, and so, after these initial things like word to vac glove fast texts which were static, word embeddings a whole bunch of models came out which used the context of a word and it's encoded a word as as a vector, but within a given context, and then more recently, they've been encoding sentences as vectors and documents as vectors books. Those vectors it's a you know.

B

The whole world has been just exploding in a bunch of models and you can think of the bird transformer GPT models, as as in the space of converting things to vectors in a much more complex way and.

F

I have a question for you, which is when you talk about these as vectors and encoding the context and all that stuff. Is there any sense in these models of temporal structure? Any difference between the last word and the word before that and the word before that, or is it all kind of collapsed into a spatial representation yeah.

B

That's a really critical piece of question and they've gone all over the place. A lot of the early models used, what they call bag of words, which is they treat a sentence just as whatever words are in there, and they completely ignore the order. The self attention operation. Doesn't it's it's permutation invariant, it doesn't know anything about what the ordering of words are, but what they've done?

B

I didn't mention: they've added an extra input which encodes the position in the input using a pretty complicated set of sines and cosines that they claim you know, fits naturally into the self attention model and so that the GPT model knows all about position as through the word s2 like Burt and those other ones, and so certainly in language position is really important.

B

You know the man hit, the dog is very different from the dog hit the man, and so so these more modern models are really using it, but it's coming in in a sort of a implicit way. That's maybe a little different than you might have thought so, here's you know the the early shock of the static word. Models like word to vac and glove. Was that not only did they map similar, meaning words to similar locations, they had these equations and the most famous one was King man plus woman equals Queen.

B

So if you took the vector for the word King it subtracted from it the vector for the word man and added that to the vector for the word woman, you would get the vector for the word Queen and in fact, a whole bunch of pairs where there's a sort of masculine feminine version. Like oh I, don't know brother sister, grandfather, grandmother, they're all related by vectors, which are quite similar and, in fact, a whole bunch of relationships like between a country and its capital, France and Paris Italy Rome. Those vectors are all the same Miami.

B

You know a city, a popular city in the state that it's in a famous person and what their job was.

B

You know an element and its symbol, a company and what its product is so a whole bunch of these relational informations, fairly coarse, relational information also gets encoded in these models, which is sort of the first hint the G. Maybe some semantics is really getting mapped into these things. I did.

C

Go back to that slide, see I just want to point out. You go down the bottom of those lists. You see, Microsoft Palmer, IBM, McNeely, Apple, Jobs, but Google seems to be run by Yahoo. Oh.

D

F

A bunch of errors in Putin to Medvedev is different than Obama to Barack France. The tapas is not correct, I mean there's actually errors in this. Oh yeah.

B

Well, so this this chart is not from gbd3. This turn I think it is from the word to deck, which is that course static thing, but even in gbg3 you know it's trained on web data, and so that's one of the challenges in terms of social implications. If these models start getting into very important central things, you know they.

A

May have bias there's.

B

A lot of discussion about bias in learn models these days, they may have been correct things. How.

D

Do you have any.

B

Sense of that yeah it's correct enough and it's good and we can trust it so I think there's a lot of interesting issues and work to be done on that side, and so so.

D

In some sense, this is sort of these.

B

Hacky things that it's like a little kid like a four-year-old just learns things from what they happen to hear, and these models are learning things what they happen to see on the Internet. It's.

A

Mostly reddit, which is.

D

Human condition also right the human brain is able to take this cacophony of random stuff some of its true. So it's not informed.

B

Some kind of cohesive, yeah.

C

My mother can't keep track of Google versus Yahoo.

B

So getting to your point about temporal information- and you know language is all about much more complex structures at the next level. It turns out for these work like models. If you look at the embeddings that they create they, they create trees in the vector space that correspond to the parse tree, the linguistic parse tree. So, even though, there's nothing in classical linguistics built into these models, it turns out they actually rediscover a number of things. So there's a whole field.

B

Now that people are calling Bert, ology Bert was sort of the first and most prominent of these models, and so people have been analyzing. What are these verticals learning and they're looking, particularly as you go down the different layers? What kind of in Meishan are the vectors in these different, deeper and deeper layers?

B

What are they encoding, and so this paper, what it discovered is that eight sequences in a classical natural language processing pipeline that used to be human bill, are sort of showing up in the multiple in the layers the deeper and deeper layers of birth. So early layers discover the part of speech what constituents there are, what the dependencies between things are, what the entities in a sentence are the semantic role.

B

Labeling of entities coreference between you know, pronouns and nouns semantic proto rules and then relation classification and so kind of remarkable that an end to the end that prop train model appears to be discovering some of what classical linguistics has tried to figure out and I. Think this place to something really fundamental in linguistics. Their notion of semantics, like I, would say. Probably the dominant notion of semantics is Montague semantics, which tries to map structures in language into some form of logical formalism. They used to type to lambda calculus who's. What Montague was promoting?

B

That's still probably the dominant way of thinking about it, but back in the 1950s there were alternatives and the one that seems to be driving.

B

What's going on now, sometimes it's called distributional semantics and one of the early advocates was this gentleman John Firth and his most famous phrase was, you shall know a word by the company it keeps, and so his idea is that the semantics of a word is the probability distribution over the work over the words of the context of in which it can appear, and so you can figure out the dog and cat or similar, because dog and cat can appear in similar sentences. You know the man took the dog for a walk.

B

The man took the cat for a while, and so because all the context in which those words could appear that that's where their similarity comes from and his idea was that all of the semantics of language comes from that kind of syntactic or so that kind of statistical structure and I think that's being borne out by these statistical models.

B

G

Yeah, how would you contrast that, with Chomsky's deep structure notions you.

B

Know I don't know Chomsky all that well, but I think he believes that the human brain was born with a lot of linguistic knowledge built in that we know all about the kinds of syntactic structures which are allowed and that when a small child learns a language they're like setting the parameters for a built-in language model, he never really explained where these language models came from.

B

Given the human language is only two hundred fifty thousand years old, you know, and so I think he assumed that there's a lot more structure built in, whereas these bottles, you know, have basically no structure. What.

A

About set up George, Lake Huff's step where the meeting is much more grounded in kind of via sensory modalities and the actions you do and.

B

You know I'll talk a little bit later, that I have I'll say it now, so I used to have a great thought experiment, which was let's say, you're a really good pattern: recognizer statistical learner and you just watch TV all day. Can you learn about the world from that? And it seemed to me that you could that basically very quickly discover the notion of frames.

B

You know TV frames, you know about like what what pixels are near one another in the in the picture and then pretty soon you'd realize that oh there are, you know, blobs of color that tend to move together and then you discover their objects and then you'd probably build 3d models of those too. As best explanations for objects, rotating and then you'd maybe discover the laws of physics.

B

You maybe discover how that there are these things called humans, that sort of move in complicated ways and maybe start getting psychology, and then you can tie it to the language that they were using and that would ground the language in meaning, and so it seemed to me, like you could make a credible argument that, given enough TV and a smart enough pattern recognizer, you could build a pretty good model of the world from them, but it seemed to me back in them that, let's say you just heard radio, you listened to the radio all day, that'd be hopeless.

B

You hear these sounds and you'd be nothing you could do. This is sort of I think arguing the opposite of that and maybe I'll skip ahead to a slide, showing the kinds of things that you can't real-world semantics that implicit in GPT it can discard it nose so after being trained just on web text, you can then probe it by asking questions like let's say: does it know what the president of the United States is?

B

You could say the President of the United States is blank and then is supposed to fill it in and it would say the highest probability. Word is Trump or you know, so you can use that kind of statement to extract the knowledge that sort of in these in these systems and using that people have found. It knows all the US presidents and Russian leaders in temporal order. It knows the latitude and longitude of cities in the United States and Europe and their relative distances.

B

It knows the relative size of many objects like cars, elephants, humans and houses. It knows- and you can test that by things like is a car is a human bigger than an elephant or an elephant is bigger than a human which of those phrases has higher probability, and it knows what animals are dangerous. What objects are dangerous? How smart different animals are, what clothing is appropriate for different age groups for a different emotional arousal for cost for different weather conditions, the qualities of mythological character creatures, physical properties of objects like rigidness, strength, transparency, whole-part relations?

B

You know that the hand is a part of an arm and an arm is a part of the body, countries and cities, their capitals, their gross domestic products, their internet usage, all that kind of stuff. So all this information- and you know some of this- is explicitly in the training set, but some of it sort of emerges from it and somehow these models seem to be capturing, this kind of semantic information, and so suddenly, if you start seeing that you can now imagine that a smart learner, just listening to the radio could figure out.

B

You know what the objects are, what the categories of those objects are, which objects are most similar to one another Oh? What they're, what the actions tend to be? What simplified, intuitive laws of physics might be when intuitive sort of psychology might be. You can build up probably a pretty good model of the world just from sufficient amounts of text. So that's my current stance I'm happy to hear arguments against it. If anybody disagrees so.

A

Just to be specific, you're saying that there's no need to have any sort of action, or you know, motor command or sensory motor loop or anything like that. You could just be a passive observer and I.

B

Think, certainly interacting is very helpful. You can learn things much much more quickly, so so here's my little chart at exactly that which is you know. Biological organisms are interacting with the world that lets them probe. You know aspects and they don't know well, they try something and I think they're very driven by when mistakes are made. You know you you push on something, you predict something's gonna happen or something else happens. Then you're really interested.

B

You start, you know playing around with it and so I think it's very helpful for building models of the world. One step removed from that is assimilated like a video game of the world, and that's probably as good I, now think that yeah, you don't really need that interaction. I, think it speeds things up, but I think if you have a sufficient amount of raw video and you just do it a good statistical model of it that you can build up those kinds of things and now I'm.

D

B

Start to lean toward you know, a language stream is sufficient also, and so so that is like. You know, sort of a radical shift in my perspective and I think maybe in other people's perspective as well.

B

So yeah so related to that is this other piece, which is some people, are calling this gbg3 type of interaction, so the old style model wall and the old style neural nets. Were you design a neural net for a specific task? Like let's say you want to do sentiment analysis? You want to look at movie reviews on Netflix, and is this a positive review or a negative review? The way you used to do that? Is you take a bunch of reviews and you have humans label them yeah. This is positive. This is negative.

B

You build a special-purpose neural net and you train it on that task. Then they kind of got the idea. Omph it I think was the thing that shifted people from that view to a kind of transfer learning view where you train a big big model on maybe unsupervised learning tasks and then once you've got a good model of language. You then put a little teeny layer on top that's specific to the particular task.

B

You care about, say sentiment, and then you train just that extra little bit, and that has been the paradigm over the last few years, and so the first thing, maybe would be the old days in software 1.0, where you design an algorithm to do something. Software 2.0 was you design the neural net and you train it to do something. The new paradigm.

B

Is you don't even do any training, you just build a huge language model and then you describe in English what you want it to do and these models are able to do that somewhat and so whether.

D

This is really going to be the.

B

New paradigm software people know that's going to supplant these other things or not. I think it's way too early to tell, but the fact that this is even possible is kind of mind-blowing, and so let me show you some examples.

B

I'll, wait till later keep Dharma so in this gp3 they train at once on this big, huge corpus and that's it once it's trained. They freeze the weights, they never touch the weights again.

B

On the other hand, as you run it it's doing this attention thing, so the attention weights are changing all the time and some people are thinking of those attention weights as a kind of fast weight or weight that is sort of dynamically, depending on what the input is and so, and that way you might think of it as a kind of neural training, but in its inference and so in their tests, they did three kinds of tests. They called zero shot, one shot and fuchsia, and so is it.

B

Let's say they want to translate from English to French notice. It was never taught about translation just that there's some webpage out there. That happens to say. Oh, the translation of this word in French is this in English and that's all it's using to figure out what translation means, and so the zero shot version is, you just say, translate, English to French cheese, and then you ask it, then you put this little arrow symbol, and then you hope it figures out to say fromage here and remarkably often it does for many of these tasks.

B

You can give it a little more context for what you mean by giving it. One example, which is you say, translate English to French sea otter goes to Lutheran in there. What does cheese go to that? Helps it a little bit or you can give it a examples that they call few shot and they put enough examples in here to fill up their context window, which is 2048 tokens and they say that's typically somewhere between 10 and 100 examples.

B

So that's the three settings in which they test this thing on a bunch of tasks and here's a typical kind of simple task where they take words and they stick a garbage symbol in the middle of the word, and the job of the system is to remove the garbage symbol and they show how their small small system does on it, the medium-size system and then their biggest one and as a function of the number of examples they give in the context and for the big system you know, given one example of what you wanted to do is pretty good.

B

It's. You know that it can do a quite a credible job on just given one example and then, as you give it more examples, it does better and better. So it's a very weird way of programming. And yet, but you know, that's the the framing in which they do it do everything here and they do a bunch of different tasks. Something like you know, 25 or something like that, and this is sort of overall how it does on zero shot, one shot and two shot.

B

So having more examples in the input definitely helps it and having bigger models, helps with a lot 42 different benchmarks that they do them, and many of these benchmarks are standard, really a hard benchmark. Superglue is a standard natural language processing benchmark which you know requires all kinds of you know figuring out, like what sentences follow from one other sentences and what things entail other things. So these are not trivial tasks and in many cases it's like.

C

B

A lot of them, but not on all it.

C

Would be a typical human performance on these curves? You know.

B

That's a good question: I'm, not really sure for something like this one. You know once a human gets that oh they're, putting in Asterix in the middle of the word on sister, remove the asterisks. The human would really quickly go to 100%, so that shows that these things are not operating in the same way that humans are they're, not really figuring out.

B

You know the the abstract notion of what the intention is here, they're doing something in between, and in fact my final slide here is talking about Kennan's Thinking, Fast and Slow type, 1 versus type 2 thinking. So so in human thinking. There seems to be at least two forms of thinking, one which is unconscious and rapid, which they call type 1 and 1, which is deliberative, conscious and involves reasoning which they call type 2 seems to me and I.

B

Think other people are starting to think that the deep learning in general and these kinds of models specifically are really doing a good job of type 1, learning very rapid, but not multi-step, and for many of these tasks you would do a lot better. If you have real multi-step reasoning and so I think sort of where AI is going I think is to take this kind of immediate model and combine it with planning and reasoning type of systems.

A

So if it's generating, you know entire essays, you know to me that seems like it's more than just a type one kind of thinking.

B

Is really planning those essays in the sense of it's not like sequentially, considering different things that it might end up on I think it has built in some kind of high-level semantic knowledge and it figures out what semantics it wants the essay to have and then it sort of lets it play out. So so it is, you know it's sort of like when people speak, I think most language speaking is also not I'm, not planning out what I'm gonna say in the next sentence: I'm sort of letting it sort of emerge from from a structure.

B

That's there, whereas a really good essayist will figure out. You know: oh yeah I want to have this emotional impact and to do that, I need to go here so so, but yeah I think a good point and you'll see some of the the training examples also you know like they have a sequential character to them. So it's an interesting question Oh in particularly this one.

B

So one of the controversial things is, and they were arguing about this about GPT- is it can do certain forms of arithmetic, and so, if you ask GPT, you know what is 22 plus 33 it'll, give you an answer, and sometimes it gets the answer right and sometimes not, and so people were hypothesizing that basically, if it saw a particular problem on the internet somewhere, it would remember it and they would give you. You know the answer for that.

B

If you didn't see that problem, it would generalize it say: Oh they're, showing me some numbers and I know what a number is and so I want to generate a number of the right form, but didn't really know what addition was and what's remarkable is so they tested this on two digit addition and subtraction, three digit, four digit and five digit. The big models they're not quite doing completely right, so they're not really learning the full addition algorithm.

B

What they're doing you know in some cases quite well, and so they seem to be generalizing beyond just rote memorization, and so what exactly they're doing is not so clear but there's a kind of emergent behavior coming out in these big models, which goes beyond a sort of associative memory.

F

On this one, so I mean it's striking how you go from these thirteen billion parameters to 175 billion in order to get the reasonable results. I mean what is the thinking in this community that the scale of resources required for it to actually solve any real-world problem seems you know way out there to me. I guess. Is it a view that Moore's law will bail us out?

F

So it's okay just keep throwing these resources out at getting these answers, and eventually we would be able to deploy this because Moore's Law will keep answering that question I mean. Is there any thought on the pragmatic level of the scale of these things? These of the actual deployment you.

D

B

These are big but they're, not that big for inference once it's trained. The training is very expensive and very- and so that's clearly now I think big companies see. That is that you know part of their competitive edge.

B

If they can afford a bigger training regime, bigger training data, then they can get better models that are very hard to duplicate, and these models can actually be compressed quite a bit, there's a whole industry of distillation and there are various variants of Transformers which use locality, sensitive hashing, to speed them up, and so once you have a trained model, you can often compress it down to fit on a much smaller machine and in fact there are transformer law running on cellphones right now and so I don't know how long it would be before you could take a model like this and put it on the cell phone.

B

But it's not totally out of out of possibility, certainly running it in a data farm for something like an Amazon, Alexa I think it would be totally practical for inference in that kind of a setting. So.

G

Question somewhere along along the internet, it probably found a multiplication table somewhere. So the question is: what would drive a model like this to to drive questions like arithmetic toward you know a compact solution, yeah.

B

I think that's really a central question that you know we barely have any understanding of what the representation of operations is inside these models and and I. Don't think anybody really knows what their computational capacity is either I started. Looking at the self attention, operation and I. Believe you know if you handcrafted I, believe that would be sufficient to actually represent I. Think it's computation Universal if you use it in the right way, but whether you know back prop learning through self attention can learn, say a real addition algorithm.

B

It's not even clear there and somebody took GPT too and trained it on chess problems, and they they had it playing chess, not very good chess, but it could. It could play something of a game of chess, and so so it's weird. You know these strange models that are sort of halfway between just general-purpose neural nets and they have something of a computational element in them. I'm guessing these are very early days and that they're going to be new variants of these, which will be much more applicable especially to this kind of task.

B

And here that's like they're, asking this thing: a language model to do the stuff that it really shouldn't be doing arithmetic kind of to see the probit like how you know. How is it doing.

H

Steve I remember I, think when I read the paper that in this specific test they mentioned that they look for the operations that they're trying to predict and they only found nothing less than 1% in the data set. So this there might be some level of generalization there.

B

Oh yeah I definitely think it's generalizing. It's not just that. It's you know, especially when you get up to five digit addition very few of the five digit five digit addition problems could actually be out there in the internet. So it's you know, but it may be doing it in a fairly simple way like it's, you know doing the two digit addition on the first two digits. You know it's doing it's combining it's combining knowledge. It has about elements in some way and so in some sense you know somebody should really nail this like.

B

What exactly is it doing and I think this is still early days, so we don't really know quite but yeah, that's a great point, but it's got to be generalizing in some way.

H

Yeah I had another question: I actually asked back then, but I I didn't realize it was a mute so in when you were talking about how to learn general AI. So here we are talking specifically about the language model. So what do you think about the way forward to use this image says well, I.

H

Imagine we have to do some grounding like map language representation to features in the visual domain, and if we can do that, we could maybe even reason about the visual domain, but in the space of word and then we can figure out things in the visual domain. Just reasoning about it onwards, which should be similar to what we humans do right like wishing a horse has four legs in a ponytail and then, if I can reason about, if I can identify four legs in a ponytail and I know reasoning.

H

The space of four is that that makes a horse. Then I can classify that image as a horse. So you see that happening like moving forward. Two states modest. Oh absolutely,.

B

I think that's super important, combining language and vision and.

D

B

We just we have a little reading group. We just read papers on combining language with reinforcement, learning also that often there's language involved in tasks where you're trying to plan to you know certain activities, but yeah there's a lot of people. You know in some sense, like image, processing and video processing. I think moved ahead ahead of natural language for a number of years there and so they're.

D

Really good now.

B

At recognizing you know, specific objects and people and facial expressions, all that kind of stuff and tying it together with these language models, absolutely critical one interesting thing: somebody did some experiments with much smaller models where they trained an English model and trained a French model, and then they found with no examples of this English is this French, but just from the relationships between the representations, those two models found, they were able to align them and do a build, a translation model, and so I have a suspicion that something like that may also be possible in the visual domain and the language domain.

B

So you take a language model like this, which has discovered certain categories and relationships. You do that you do a visual model which you know, identifies objects and certain relationships, and then you look for you know sort of correspondence between the two models.

B

You may be able to actually get an alignment between language ground, the language in the physical reality without ever having trained it, and so you know, I, don't I, don't know of anybody having done that and clearly I think it would be better to train them together and so I expect that you know next version down the line. They're going to be, and probably Google is doing this just running every YouTube video through it, where you have both language and video and building complex models that have both vision and language in them.

B

At the same time, clearly, I think that's that's the next step and whether the transformer thing and self attention is sufficient. You know the the the image transformer that openly I, just just released like two weeks ago, is really interesting that regard because they they use exactly the same model for the in the image domain and it seems to be capturing visual data pretty well. So maybe that is a sort of Universal learning.

F

B

I'm guessing that more it's going to be needed, particularly around the issues of planning and stuff that we were talking and recently we were talking about, but but I think it looks like you can do a lot with just that simple element. There's.

H

A there's action similarly work on that on learning this mapping, language different words, learning, for example, inner probe, just mapping those those to representation, but it's not at the point- that's actually useful, so yeah.

H

My question I think you just answer is that is this gonna just make computer vision models go forward like go to the next step right. The next image, Annette step yeah, you.

B

Know I think we're it's really gonna help is the vision. Models are doing pretty well when you have known classes of objects, but I, don't think they're getting all that. You know there's a lot more deeper semantics in in natural language right now, and so, if you could tie all of that knowledge in natural language, like you know, it's not just a person is smiling, a person is joyous. You know because they just were given a gift or something is it, but that kind of thing is hard to do in a pure visual way.

B

F

Of language I think I.

B

Think we're gonna get real richness of semantics on both ends. So word scrambles, another simple little thing: they just take words and they scramble the letters and it's supposed to unscramble them similar kind of phenomenon. They have a whole bunch of examples like that. So so, because this thing seems to be showing characteristics which were unexpected. There's a lot of controversy online and also in discussions about how powerful is this thing? Is it really just learning things?

B

Is it just a you know, kind of scrambling up the internet, remembering it and spewing it out, or is it doing something more and there are opposing factions that are forming so one blogger who goes by the name Guerin? It did a lot of experiments with TPT 2 and has been recently doing a bunch with TPT 3.

B

He actually thinks that this may be sufficient scaled up, say by another factor of a hundred to have emergent behavior, which starts looking like you know, general intelligence and so he's sort of in the camp, not saying that it will happen, but that it's a possibility and that we should be prepared that this kind of model may lead to general intelligence.

B

You know, and then, in a matter of years, rather than a long time on the opposite side, here's a well linguist Emily bender, who does a lot on linguistics semantics and she just came out with a paper arguing that this kind of model is not in principle not able to actually capture real meaning and like a typical sentence from the paper. Is we argue that the language modeling caste, because it only uses form, is training.

B

Data cannot in principle, lead to learning of meaning and so clearly there's a divide in what people think about what meaning is and where it comes from, and so that's very interesting.

B

You talked about those just to give a couple more examples. These are some things that Guerin has been doing recently with GPT.

B

Three and I thought were amusing in, though, in the image domain, people have been able to do something, they call style transfer for a while, which is you take a photograph of something somebody's face, and then you show it a van Gogh painting and you say- apply the style of the van Gogh painting to this image and you could do a pretty good job using Ganz style gain, or something like that of doing this, separating the the meaning meaning of the content from the style of the content.

B

So far nobody's really been able to do that with text that I I'm, aware of and yet with GPT three all you do so he did a bunch of experiments. We said summarized the Harry Potter story, which I guess it has read. It knows about online in the style of different authors, and so here's Harry Potter in the style or Ernest Hemingway, and he started it off so that it, the the bolded text, is what he gave it and the other text is when it generated. So it was a cold day on privet drive.

B

If a child cried Harry felt nothing. He was drier than dust. He had been silent too long. He had not felt loved. He had scarcely felt hate, yet the Dementors kiss killed nothing death didn't leave him less dead than he had been a second before and because I was pretty good and then he asked him to do the same thing summarize Harry Potter in the style of Jane, Austen and it generated, is the truth, universally acknowledged that a broken Harry is in want of a book this.

B

He knows to his cost, pressing that wretched nose against the window of a bookstore in Diagon Alley. The last thing before he goes into hiding for several years, whereupon he goes straight to ask upon.

D

You know even even.

B

The idea of doing this is sort of shocking and the fact that you can get it to do that with just that. Little bit of context is pretty amazing and.

A

It was never trained, obviously, to do any sort of a style trend. It's just trained to predict the next word. That's.

B

All it is, and yet you describe it in your to and it does a kind of credible job of it. So.

A

In this case, I mean mechanistically, it's given the bolded text. As a sequence of you know these uh pairs, these tokens and then it's just asked to predict the next token yeah, and then you include that next token in and then you predict the next one and so on.

C

It has to be more than that because it, including things such as the word specific for the hard part of stories that are not in those first three sentences, yeah.

A

But it must have already learned it from that from from no.

C

But I think somehow it must be told I mean it must be told, take this Harry Potter book or something that's.

B

Involved right here it says the novel series Harry Potter, that's.

B

And it's got: it's got Harry Potter in his weight. Somehow.

D

G

I have a question in the context of where it's trying to make the prediction: if it gets, let's say it doesn't have a very high probability of any of the choices it has it would that be a pretext for it to actually be able to formulate a question back out.

B

You know when you generate text from these. There are various ways to do that. It's giving you a probability distribution over next word. If you just take the highest probability word. Sometimes that can generate aberrant things. It can generate sentences that cycle, and so they often do what they call a bean search. Where maybe you take the ten highest probability words and you crack at them on a little bit and you you sort of find the highest probability, sequence and so I'm, not exactly sure. If you know, if they're doing anything like that here, yeah.

G

Well, a variation on a sublimation, and that is is to look at you know the two highest probability and if they're nearly co-equal in magnitude, then you know note that there's an ambiguity and whether that would be enough to to just say here's two possibilities which one you know and and therefore feet I mean you can think of I'm trying to incorporate the exploration aspect of you know. By going out to a distillery, a human, you know and basically say: okay, this is ambiguous to me.

G

How would you differentiate that and whether I mean, if you can imagine automated that in some particular way, then you could you know, but it's where do you get the distiller from who it would have to be crowd-sourced humans or something I, don't know, yeah.

B

Yeah, you can definitely do that. There's a really nice there's, a group called hugging face to startup company, that does natural language and they have created I. Think it's 10 or 15 different transformer models, bird and GPT to I.

B

Don't they don't have tree yet and they they have a github repo with all the weights and everything you can just run it and they also have online something called talk to transformer where you can play with it and in one version of that it shows you I think the top 10 success of words and you can use it as a as an authors tool so that you can write something and it'll show you what it thinks the next best word might be, and you can either pick that or put your own thing in and and sort of explore in that way.

B

So I think there are a gazillion ways of using this kind of tool to enhance and to include human input. You know in actual doing something useful like you. Maybe you don't want to generate news stories or something you could maybe just have it generate the story, but it's probably better, like you say, to have a human in there curating it and guiding it in some way and then the best way to interact these things in terms of products.

B

I think is totally unclear at this point and so I think there's a whole new kind technology that could emerge from this. So I'll give you just one more example, because it's super amusing, which is there, are a bunch of critics of this. It's a Doug Hofstadter. Actually, an old friend of mine wrote to Goethe lecture block and it's a real. You know champion of understanding deeply, how language and the mind works and I think he rails against these kinds of statistical models really hates it. I mean Gary.

B

Marquez says somebody else: who's written a whole bunch of things, sort of criticizing these deep learning models, as basically missing some of the essence and so I, don't know if they wrote this together or whether this was just something that the guy who was testing it did it, but he started off an essay about how AIS could never be creative and never have good language, and then he asked gbg3 to continue it, and so the continuation is. They will never be able to truly understand the underlying meaning of things.

B

Additionally, we still don't have programs that can truly imagine or create new things all by themselves in a natural way, but rather just brute-force through random strings of garbage until they happen to hit him on the correct solution to the current problem. We still lack a rosetta stone for the universe and don't really know what things like light. Sound matter, gravity, life, freewill and consciousness are was generated.

B

F

A question which is you know: we have a natural language partner who has done a lot of work on this using sparse representations and that's a lot of what we're doing in our own work now is with sparsity. Have these guys look at all its first representations as a part of what they're doing to you know, get better context or better performance or anything that you know of well in some.

B

Sense, like the the self attention module, has a soft max in there that tends to create sparsity, that's something which in, and so, if you look at what the attention weights are, it's they typically are only have two or three words that have appreciable weights on them and so the attention that operation naturally has sparsity, and I think a lot of the vector representations tend to be sparse as well that they tend to capture and represent different semantic things in different parts of the vector, and these are pretty big vectors and so, but actually adding in sparsity saved as an extra weight in the back prop back I could imagine that would actually make things work better and walk in on more on tighter semantic information.

B

The interesting experiment with it.

H

Steve, you talked a little bit about inference. Do.

A

H

Idea how long it takes like on a single GPU to do inference for a 500 paragraph, considering you have been searched with K equals 10, for example, that's 5000 words infant. You.

B

Know I, don't know much about this: the GPT 3 on GPT, it's very rapid. You know a few milliseconds kind of thing, I'm, not quite sure. Also you know the GPUs have been getting better. Nvidia keeps cranking out more and more powerful GPUs, and so, and one of the reasons that people are so excited about the transformer architecture is that it Maps pretty well on to a GPU that the attention mechanism that they chose part of the reason they chose it is that you can run it.

B

As you know, dense matrix multiplies, and so they work pretty well with current GPUs.

I

um I wonder what would happen if you gave it a prompt that was kind of syntactically correct nonsense like like you know in Rick and Morty, or these Syfy show when they come up with words like oh, the SH, lor / went on the bring Brahm.

I

You know, sort of thing seemed like something that it wasn't, because when I see, for example, this prompt return from Douglas Hofstadter, you know that's like a pretty well-known book and, don't cuff sorry go to lecture, but I mean, and you know, there's gonna be a lot of his writings on the internet, which are presumably somewhere in the weights of G PT 3.

I

So I wonder like if you, if there is such a thing that isn't in the way it's of Chi PT 3- and you know- and is this like, like the approach that these people, this old AI approach that, like you know, try to lay out every possible cognitive concept and list it and put it into the model. Yeah.

B

It seems to be much more flexible than that. If you go to this this website here, this is Gorn. He did a bunch of weird experiments and so in particularly asked it to generate puns and it creates its own words and- and you know okay, so he does a lot of. That is exactly how it's doing it and what it's doing you know it's it's funny. It's it's! It's sort of like interacting with with something different that we haven't seen before.

B

A

A few examples, even in their paper, where they kind of define a nonsense, word right in the sequence and then they ask it to generate a sentence on that. So they have an example. Like you know, Giga Maru is a type of Japanese musical instrument, an example of a sentence that uses the word. Giga Maru is : and then it has to fill in and it fills in something like I have a Giga Maru that my uncle gave me as a gift. I love to play it at home.

A

So it's able to kind of do those kind of substitutions. Anyway and clearly, you know, there's no Giga, Maru and.

H

Exactly they also did the other way where they they use the word in sentence and as per the dictionary definition, and it comes out the same there you can it both ways cause.

A

You just make up some random thing and ask it to come up with.

H

You write a few sentences if the word and you ask for the dictionary definition, and it will give a dictionary definition which makes sense, or you can do the other way. What you're saying you do, the different definition and then you ask to make up your sentence and then it makes it. It works both ways.

B

D

I would not have.

B

Guessed that this kind of system could do that, and so I there's still a big gap in my own understanding. How on earth is it doing that? You know what so I'm hoping that people will start probing this like take some tasks like that and figure out exactly what kind of knowledge it's getting at each layer and and and tease apart. How is.

A

It doing that you know it's able to solve these kind of SAT analogy tests, exams, I, think almost at the level of a human yeah.

H

I think it's higher than the average right. That's what this is it: okay, I.

A

Don't know if I'll get into Yale yet, but it's getting up there.

H

One thing that I've seen the near future: if you ground this language models to image, is that you can probably create movies or games with these generative models. You just write the tax and you can generate the image from the tax right. It seems like a very good application and game industry's huge moving industry's huge and it can just save a lot of cost of this yeah.

B

I mean NVIDIA has been doing some remarkable stuff where you draw a little sketch of something, and it makes a photo realistic image of that thing and then, like you're, saying I've, seen a little bit of people taking text descriptions and generating images and videos from that. What does that mean that future movies will suddenly be? You know on the fly it created? You know, maybe you give it a topic and it generates a movie for you remarkable I.

G

Would be the analogy of dreaming? I would think.

E

I'm curious, so most of the talk has been about gbg3 and the other one you've mentioned a few times is Burt and I'm. Just curious. If you have a perspective on, is GPT 3 like better or does it just demo better I'm curious? If you just have a perspective on that yeah.

B

So bird is a different architecture, so GPT 3 is an autoregressive language model, so it is trying it. It's basic thing is to give probabilities of the next word and it's trained on trying to predict. The next word. Burt is better to think of as a denoising, auto-encoder, and so the way they train it is they give it sentences and then they block out 15% of the words and then it goes. It tries to compress it into a new representation and then generated and supposed to generate the original sentence, and so that's its training model.

B

So it's a sort of a different framing and the the advantage of burglar one advantage is that it can use both the context before and after a word, and that I think gives it more power in terms of learning. But it's harder to train and the most of the uses of birth. That I've seen have all been. You train the language model, and then you have to train it again on the specific task and so I haven't seen people doing the kind of thing that they're doing with TPT 3.

B

Maybe they are and I'm just not aware of it, and so my sense is that Bert that Bert architecture in general should be more efficient in using data, but that it's more complicated the Bert. The thing is pretty sure pretty tricky, and so it looks like where's DVD three is sort of the simplest thing you can imagine in a way and it seems to be working well. So I'm not sure you get a lot of benefit from the Bert thing, but you know I'm sure both both channels are going to be moving forward.

B

We'll learn more about what architectural features are really important.

H

Don't you think, it's fair to say featuring a hundred and seventy billion birthmother, it's gonna be better than the autoregressive just Bazin day like they have not that.

B

Would be my guess, but I I'm not sure if it's as easy to train and and if you could use it in the same way like I'm, not sure if Bert can generate text in quite the same way that GPG does yeah.

A

It may not be as I don't know, just the way it's set up. Could you actually just give it random prompts and have it fill in stuff? Without you know, training it specifically for a particular task. Yeah.

B

I, don't have a sense of that one.

G

Of the real hard rural translation problems has Chinese kind of stands out because there's a vast amount of context in it and it's kind of resistant, normal efforts. Do you know whether they've applied either bird or gbd3 to Chinese translation.

B

I'm sure you know, China has been doing their own versions of these model. True and I'm sure they probably have I have not seen any but yeah.

D

B

Interesting because China I think has a different, like I've heard that there are things you can say in Chinese that don't translate well into English I used to like to look at the daodejing and look at their like 80 translations of the down engine. You take multiple translations and you try and see the corresponding text and in.

D

English they just sound totally different.

F

B

So it'd be really interesting to know what these language models would do in trying to align two very different languages like that yeah.

G

H

Have a follow-up questions, so you talked a little bit about reinforcement, learning and do you know of any work or your work of using reinforcement, learning to fine tune these these language models to a reward function which is defined by humans.

B

Yeah, that's a really interesting question. I. Think in fact, could you could you state.

C

That quite I didn't follow the question: if.

B

Using reinforcement, learning to improve language generation and tie it more to human needed things, I mean one area is catbots, chatbots are still pretty bad and google came out with something called mina which tried to use some of these transformer things to make a better chat. Bot and I've heard that it's pretty good, but I think it's still not really. You know something that you would want to talk to for very long. They tend to lose context over. You know, I think after 15 interactions, they kind of forget what you talked about before.

B

They don't have a very good model of the user they're, not very empathetic. They don't really know what you know and don't know, and all of that I think. Reinforcement learning could help a lot with, and so I would love to see a much better. Integration of reinforcement, learning really planning how to interact with a person or with another some other system, together with the rich kind of semantic knowledge that these models have seem to have.

A

So keep trying to think you know what are the limits of this kind of paradigm? You know you could argue that maybe gbd3 has been trained pretty much on all of the text. That's on the web, or maybe close to that you know, is that sufficient to you know solve the Turing test. Is it sufficient to create something that can generalize and really seem intelligent, not that the Turing test is a good test of intelligence, but you know to the extent that you can interact with something in text.

A

You know what what are your you know? What are the limits of this.

F

My head was going on that was novelty and inference, and if I, can it really infer things that it's not been trained on I mean you know like one of the things Jeff talks about a lot as staplers. Okay, you learn about how a stapler works. Maybe it knows how the stapler works. Do you know that you shouldn't staple your hand and put a staple in your hand? How are you gonna know that I mean you as a human? You can inference that I would expect it wouldn't be able to make an inference on that.

F

It's just not something that's likely to be in the corpus there of somebody stapling their hand. You know I, just I feel like I have a tough time imagining how it handles novelty. Yeah.

B

That's an interesting point: I mean my own sense is that multistage reasoning is, is its weakness that thing's? You know the kind of reasoning that it finds directly and then slight generalizations of that, like the staple in your hand like I, could imagine it would get the idea that you know sticking a pointy thing through your body. Part is painful and that's not good, and that a stapler you know might be like that. So maybe it could do that, but certainly other kinds of problems which require two or three steps of inference.

B

I think it's weak on. Is we see with the addition kind of problem that it seems to beginning to chain together in these things, I know, there's a lot of research on.

B

How do you do multistage inference in a differentiable way through through knowledge bases and so I think that ties to the Kahneman faster Thinking, Fast and Slow that if we think of this thing as basically doing the kind of one-step inference, then let's say let's say we gave it the question of you know: is it okay to staple your hand that might require two or three steps of inference?

B

You know that you know take something like it's bad to stab yourself, that that causes damage to your tissue staples are pointy that the way a stapler works. You know, so you might be able to find a chain of statements which lead to that but I'm guessing it probably wouldn't do it on its own in the current status and so combining it with something with a bit more reasoning planning you know, multi-step thing might might get you to that stage.

H

It seems to me that if you combine a full reward based model and then.

I

It's close, you can test a lot of these ideas and you know, oh.

I

E

I

We can access like prompts to GPT three I was wondering like it. I did a quick, google and like it doesn't seem to be readily available, but I might have dismissed it. It's.

B

On there's an open, a website where you can apply for access to their API and I think they're, eventually intending to sell or rent. You know. You know you paper, paper, access and I've heard a little bit. You know, Microsoft in opening I are sort of working together and guessing the sewer will get a version of this, but it sounds like Google's gonna. Do it so I think you know in a year or two they're gonna be a bunch of models like this out there. You know the code, for this is pretty Genie.

B

You look at you go to hugging face. The Python code is maybe ten pages, something the compute power is huge and the data set is huge, but you know it's not not complicated to write these models.

H

Go back to the other question, so it seems to me that if you combine this with a reward base model, what you can do is you can have this generative model generate like an action bear into action for the next time step and your reward base model is going to learn a policy which is just the sequence of actions, and then you can do planning on on longer sequence of time, based on this reward. So you can actually use that as a component as like the model of the word in the reasoning process, I think.

B

That's very interesting that you know in some sense, I think where reinforcement learning is going is they need to deal with hierarchy, hierarchical plans that used to deal with sub plans and language sort of has all that in it, and so it really seems like they could benefit one another, and so I would love to see I mean in one of the ways of thinking of gbg3 is as sort of a black box. You can just give it phrases and it'll tell you probabilities extensions that it's not knowing compared to phrases which ones more likely.

B

Can you use it as a black box inside some kind of planning system or a reinforcement, learning system? That would seem to be a really interesting. You know easy way that to make forward progress.

A

You know, Jeff has a speaking for Jeff again, jeff has a thought exercise or you know an intelligent system, as you know, which is you know, would you be comfortable, taking an AI system like this, putting it on a spaceship, sending it to another galaxy and have it explore and you know, create you know, figure out which planets are habitable setting up.

A

You know landing on the planet and and setting things up there.

C

But we can make it simpler, just say we need to build a habitat on Mars, and so let's send a bunch of robots. Well.

A

Mars is a little easier because we know a lot about Mars yeah.

C

But they'll be the problem they get there. They have to solve problems that weren't anticipated they have to write and provide things are gonna pray. You know they have to coordinate with each other. I mean we're talking like you know, okay, but you could send another.

A

Like so I guess, the question is: could a system train this way be sufficiently intelligent that you'd feel comfortable, that it could go to another planet and set up a habitat?

A

It sort of encapsulates the novelty question as well.

C

As some of the anomaly I mean it gets back to the whole sensory before you know original general intelligence in humans. Is it very much of sensorimotor problem? You know just set the table at dinner right, it's a pretty simple task: that's a sensory motor problem. Where do you put the plate? You do this? Where do you do that? What's on the table? Do you have to clean it off? You know that's billion things.

C

They do do something very simple like that, and so much of what we do in the world is essentially not a problem, and much of it is dealing with things that will not statistically and not represented anywhere. So the current arrangement of plates on my dinner table, where the where the potatoes are versus the green bean versus the pizza, whatever changes night tonight and I, have to build a model of that it very rapidly when I went to the room. I have to update my model when I move the plates around on the table.

C

So these are, these are not things. I will find statistically described anywhere and the very interactive. So there's it's sort of a very, very fast temple modification of the models we have in the world that are not described anywhere. We have to experience them and learn them, and we have to learn them to sensorimotor interaction. I can't learn them through language, I have to learn them by walking to the room and seeing things and picking things up and so on and the tasks that we need for general intelligence.

C

Also, a very sensory motor related, so I think when we look at this statistical models of the world like you're talking what you're really just statistical models the world I mean we can argue how good they are, but their statistical model world there's. No there's no I, don't think, there's any debate about that.

C

They have very shortcomings when it comes to sort of real-world, behavior of humans or intelligent agents, and you know what the tasks if we feel apply them to, which are really impressive. I mean I, agree, they're, very impressive results. They only apply under those tasks it that can be done with statistical models in the world. In some sense, they they don't apply them to things that require tremendous flexibility and very rapid learning of the world.

B

Yeah I totally great great points, though you know to sort of push back a little bit. Imagine an from natural language, description of tables and plates, and you can push Forks around I could imagine you could build like a little physics simulator with the simulated dining table with the simulated forks and then you'd have to be motivated to do things in that world. But you could then sort of operate in that simulated, world and sort of learn the kinds of actions you might take.

B

So that would be a path perhaps to take this seemingly pure language stuff and turn it into something more physically and maybe.

C

It's the question: you know much of I've always believed that we don't learn about the world. Through language we've learned about certain things in the language and most of what we know about the world. We don't we learn through observation of other sorts. They can be observations, auditory, tactile, visual and so I'll use. You know, I have an example. It just takes something like I use, my bicycle every day and it makes sounds right. It makes various sounds and I know.

C

These sounds I know the sound it makes when I clicked on the kickstand I know the sound makes my turning here. I know the sound it makes from the chain. It's there anyway. They don't sound. All these little sounds I know about it and they're in my model of the bicycle. I don't have words to describe those sounds I, don't even have words to describe the things I'm making the sounds. I kind of know that yeah this thing this w, but I may never know what the word for it is.

C

So maybe someone else does I, don't know, but I don't learn. I, don't worry in the world through language I learned something so much obviously would communicate a lot of knowledge that way. But the point is it's not that I think that's not! That could be critical. These models I think when you think about language, you could only capture some parts of the world through language. There are acknowledged that we individually have that it's not expressed in language. So that's a sort of a limit to the modality.

C

It's not an inherently limited, perhaps of how these systems work in general. That's why I asked you earlier how these models apply to other things, I, just I think we just should I think you know from my perspective, but we really want to create artificial, general intelligence. We have to have the assistance applied to sensory motor systems systems with actual input since with arrangements into a century. You know visual inputs and these systems have to be active agents in the world and they also have to learn very very rapidly have to.

C

It cannot be based on statistics on lots of data. You can only learn so much about the world that way much what we learned on the world. It's not like that. It's just like hey! This is something new never saw this before report. This is a new arrangement of things or if this is the new behavior that super-tiny exhibited today.

A

C

A

Of that another kind of problem, Steve that you and I are very familiar with just you know, take take three dots on a Punnett image. You know, can you tell? Is this an equilateral triangle or not I? Could you learn this purely through a passive mechanism? I.

C

Do that? Why would because that stuff that that stuff, that is statistically in the world someplace right and it.

A

Is but you wouldn't find every possible combination of three daughter, Alexa, four dots in esterified background a pentagon, and you have to kind of understand the relations you have to have in order to know it's an equilateral triangle. You have to understand the relations between these three dots, not so much explicitly the dots themselves. So it's it's.

C

Space of so as Steve showed there's a lot of surprising things. These systems do but I would argue an unlevel at some level. There's statistics in the world. You know what equilateral triangles are. It is in language someplace, it is in visual images places and so at some theoretical.

A

Generalize with yeah, but without a location space, could it generalize I, don't.

C

Know I don't know, I was trying to pick examples where this that the statistics of the models are not generally statistically reliable, they're out there changing all the time. There are different places of things that people have never described in language places. People have never been, and you know you know you know a kind of thing like yeah.

C

You know I I'm not saying you're wrong today it could be very hard to do that, but at least somehow the statistics is there in some.

B

Another thing in line with what well I think you're saying is smell. We have very impoverished language for smell, that's a huge, rich part of human experience and for other animals. It's like their dominant sense. We have almost no words for smell yeah, strange I, don't understand what.

C

We have, we have there's so much stuff. We know that it really doesn't exist in language at all. Yeah again, this is that is a that argument is not a fundamental argument. That's more of a just a practical argument about well is language sufficient, but there is huge number of things. We know that we do not have words for and or maybe they're not commonly shared. Maybe someone, someone knows what the little doohickey on the little bike part is called, but I don't know well,.

I

You can make it word up well,.

C

You could, but that's my point. My point is knowledge is my knowledge of the world. It's not stored in language. My model, the world is stored in a model that I've recreated from my my central motor interactions and I can apply language to my model. I can try to explain how I know what a stapler does, but they didn't learn a staple through language, an understatement by picking it up and opening it flipping around and say: hey look this little plate on the bottom can reverse and makes a staple go this way versus this way.

C

Does everyone know that I don't know and and say you I, don't have a word for that thing. I don't even have a word for making the staples bent outward versus inward. You know, but there's probably isn't word but I don't know point it's there's so much I know about the world that you learned through experience that may never been and be reflected in language and still be not the way. My brain works. My brain work isn't a list of language. I apply language to the models.

C

The models exist as some recreation or a storage of what I experience and it's as our work here at no Mantha. We understand how that it's happening and, and then I can say. Okay given I know what it's my model of a stapler exists. I can, let you try it try to apply words to it. I can say: okay. Well, how would I describe that part and how would I describe this action, but the knowledge about the action and knowledge of the parts doesn't exist in words. It's something I decline later yeah.

A

I want to distinguish between kind of I think two different things. One is kind of the modality of the of the sense, so we have set a language versus vision versus auditory and so on, but regardless of the modality there's a independent thing, which is a purely passive way of learning through statistics or an active center loop now I think those are two orthogonal yeah. We.

C

Know by the way, to certain see that the brain learns through such a motor interactions, I mean that's, not debatable, and we also know how it learns. We know that eases reference Renton's for storing knowledge about as you interact with the world that keeps track of where everything is in a reference very much like an engineering CAD program or something like that. That's what's going on in the brain, and so knowledge is stored in reference frame and and our models are stored in reference frames.

C

Now the question is: can you get to AGI when the system that doesn't have it? You know I, don't think so? I, really don't I. Just don't think you can. You can do a have a lot of impressive stuff, but, as you point out it's to be limits of these things, you know we finished that we finished a paragraph from Hofstetter. It's pretty cool. What's the next paragraph, you know what's the next book that Hopson is going to write, you know, so you know your first and second.

C

So anyway, you know it's it's it's fascinating. How far this stuff is gone. Amazing I think you're expressing your amazement at it too, but it's but I. You know personally I, don't think you're going to get that's.

B

Interesting, you know I think biologically. It's certainly clear that biological organisms interact with the world. You know dynamically and that.

G

I think language.

B

That whole language system was built on top of that the language comes from social, from trying to expand from the individual experience to a social experience, and it doesn't necessarily represent everything, and so because it's our shared social mechanism, we think so we can sometimes think it represents more than is actually going on, and so so it's really interesting to see the evolution of language and how that fits in. It's.

C

Also inching because all these all these AI systems today, almost every one of them we interact with them through language I mean you know if you say: okay label, this picture. Well, it's late, it's a word for it, but that word is impoverishing. If it says this picture is a cat doesn't know that door, cat people and God people you know, does it know that the cat has a heart doesn't know the cat's toes need to be clipped on it? Just ruin your furniture I.

C

Don't think it knows that now, maybe your language model, my people I, don't know. But you know ya lost my train of thought. Death.

H

Just to pitch in so say there is nothing that fundamentally prevents training these in another mode. Added so, let's say a training like hundreds, the same scale of training but in like a 3d simulated word, something like that and then.

A

H

Yeah but then the second question is: can you generalize to like our of domain distributions things you've never seen before, and maybe the answer is they can, but the the solution they're proposing is that okay, if I, can't generalize part of domain, maybe I just put everything in domain I'm, just gonna train on everything which is out there and then nothing's gonna be out of the main right. So.

C

I mean generalizing is the things that that have never been experienced before right.

E

C

Think it's a little bit of red herring of the focus on the generalization I. Think it's more important to focus on the dynamic learning aspect. You know, learning new things rapidly that didn't exist before and building models of things that that there hasn't, because you can, if you have enough data, you might be able generalize from enough data. You might be all say: here's six trillion things and which one of this causes job.

C

I'm sorry I interrupted.

H

I was gonna, say exactly that, it's great, if you have if they have seen everything which is possible, then it's just a matter of interpreting, but that's that's also possibility.

C

There's so many dimensions we can go down here. So, let's, let's check the whole dimension of embodiment right. Yes, please manipulate the world. We pick things up. We do things we move around and so on. How does this apply to that? How do you make a robotic system that can go around and figure out how to put the chain back on the bicycle? No one's never been shown that before you know that physically can do that can ride the rides. Oh, oh, what the chain fell off.

C

Oh I have, to put it back on again, I mean we're so far from that kind of interaction in these systems, which we just were funneling all the knowledge through a language.

H

Maybe they've seen it so that's my point. Let's say training.

C

Let me just see.

E

C

Mean I have to physically move my fingers and arms. Penda jism just stretch the chain and try not to get the grease on my pants and I mean you can't just see that I'm actually to do it all right.

H

Let's say you, you train it on all YouTube videos out there and you have a simulation and your goal is to replicate in your simulation. What is being done on the video so you're learning how to do all those actions.

C

So the simulation the simulation includes a robot, a robot with arms and legs and hands. Yes,.

H

Good like a humanoid and then it has to replicate what the YouTube video is doing so they're going to be hundreds of videos out there where people are fixing bikes. So it's going to learn how to fix it back.

C

Okay, so that simulation of the robot has to have moveable appendages.

H

C

So now we're dealing with a sensory motor system with all these moving parts right, we're not close to that here in any of these systems,.

H

C

Don't even robots today can't do this stuff at all. So um so now, I I'm, not saying you can't simulate what humans do I think again, I think we can build intelligent, robotic systems, so I guess I'm saying is they can't deal with? You know you're saying could I just by showing it every possible thing ever happened?

C

Could you learn to? Could it learn to do stuff like that? That's that's! Where I gave the example of the plates on the dinner table earlier, because there is no statistics in the world that tells me how the plates are arranged. My dinner table right now right now, not not general, not typical, but right where's. You know the potatoes on my plate on my table right now and I so much of the world. We do with right. Now it's like I'm sitting here. All my coffee comes over here.

C

My mouse is over here, maybe they're different in a minute. For now and that kind of you know this interaction with the world is not and have a model right now in my head, where all these things are I can't get that from statistics. You know I had to learn it really quickly rapidly in a second ago. Does that make sense, Benson yeah.

H

A

Speaking of dinner, we have lunch coming up and I'm conscious of the time.

A

Yeah, so we we have another meeting we have to get to, but Steve. Thank you so much for coming on and I. Don't know why it took me 15 years. That was a great.

F

C

F

C

Appreciate you keeping it at the high level, philosophical level of the into the insights you had as a post that gets down on the nuts and bolts of the details. It was really really good. I.