Numenta Numenta Research Meetings, 30 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cortical Column Networks: Learning object identity and pose representations from pixel observations

Description

Drawing inspirations from the Thousand Brains Theory on Intelligence, guest speakers Tim Verbelen and Toon Van de Maele from Ghent University share their recent work on learning object identity and pose representations from pixel observations.

0:00 Introduction
3:42 Active Inference
18:40 Visual Foraging
26:40 Cortical Column Networks
44:28 Q&A

➤ Paper - https://arxiv.org/abs/2108.11762
➤ Blog post - https://thesmartrobot.github.io/2021/08/26/thousand-brains.html
➤ For more information on The Smart Robot: https://thesmartrobot.github.io/

Abstract
Although modern object detection and classification models achieve high accuracy, these are typically constrained in advance on a fixed train set and are therefore not flexible enough to deal with novel, unseen object categories. Moreover, these models most often operate on a single frame, which may yield incorrect classifications in case of ambiguous viewpoints. In this paper, we propose an active inference agent that actively gathers evidence for object classifications, and can learn novel object categories over time. Drawing inspiration from the Thousand Brains Theory of Intelligence, we build object-centric generative models composed of two information streams, a what- and a where-stream. The what-stream predicts whether the observed object belongs to a specific category, while the where-stream is responsible for representing the object in its internal 3D reference frame. In this talk, we will present our models and some initial results both in simulation and on a real-world robot.

Bio
Tim Verbelen received his M.Sc. and Ph.D. degrees in Computer Science Engineering at Ghent University in 2009 and 2013 respectively. Since then, he has been working as a senior researcher for Ghent University and imec. His main research interests include perception and control for autonomous systems using deep learning techniques and high-dimensional sensors such as camera, lidar and radar. In particular, he is active in the domain of representation learning and reinforcement learning, inspired by cognitive neuroscience theories such as active inference.

Toon Van de Maele received his M.Sc. degree in Computer Science Engineering at Ghent University in June 2019. Since then, he has been working on a Ph.D. degree on learning representations for 3D scenes at Ghent University. His main interest lies in the combination of deep learning approaches for robotic perception, using biologically-inspired techniques.
- - - - -
Numenta is leading the new era of machine intelligence. Our deep experience in theoretical neuroscience research has led to tremendous discoveries on how the brain works. We have developed a framework called the Thousand Brains Theory of Intelligence that will be fundamental to advancing the state of artificial intelligence and machine learning. By applying this theory to existing deep learning systems, we are addressing today’s bottlenecks while enabling tomorrow’s applications.

Subscribe to our News Digest for the latest news about neuroscience and artificial intelligence:
https://tinyurl.com/NumentaNewsDigest

Subscribe to our Newsletter for the latest Numenta updates:
https://tinyurl.com/NumentaNewsletter

Our Social Media:
https://twitter.com/Numenta
https://www.facebook.com/OfficialNumenta
https://www.linkedin.com/company/numenta

Our Open Source Resources:
https://github.com/numenta
https://discourse.numenta.org/

Our Website:
https://numenta.com/

A

Work is our work that we're going to present today is on learning, object, representations uh from pixel data, so, as ubudai already said, we're from ghandi university small group in uh in belgium, but we also affiliated with imac, uh which is probably a bit more recognized.

A

So imac is a research center on nano electronics, so there they do a lot of um semiconductor research and with them we work, especially together with the hardware teams that develop specialized chips for ai workloads, which is like, as you all know, synonyms these days for matrix, multiplications and uh and also with their sensors team. So we also do some work on high frequency radar so, but our team is more focused on learning parts um and um and basically what we want to do is to build intelligent agents and early on. We.

A

We realized that if you want to build something intelligent, then there has to be embodiment, and so we quickly switched from doing just software to also starting off a robotics app okay. So uh so these are some pictures of the lab, so we have some robots that drive around or can fly around. So these are basically more focused on on navigation and we also have some robotic arms- that's that can move around with an in-hand camera in their confined workspace.

A

So these are type of um systems that we work with.

B

A

Regardless this is a big lab.

B

It's a big lab sounds like you're in a pretty big looks like you're in a very big warehouse. That's that's really, nice! You have a lot of space. There.

A

Yeah, so actually it's uh it was. It was designed as a data center, so the the room next door is precise for, for the whole university data center and this room was actually uh commissioned to be the the also data center, but until they don't have enough uh servers to fill that room, we can. We can put our robots here so.

C

A

So yeah, so in particular, we work on on one approach for our intelligence agents. It's also inspired by neuroscience, which is active inference, and in this talk I will first highlight what it is and how we in particular implement it on our systems and then for the second part, tom will talk about his research where he applies these principles on this on, in particular, on a robotic manipulation.

A

So on a robot farm which resulted in in an agent that was able to do some visual foraging in the workspace.

A

But then we realized that there were some shortcomings and we ended up with uh the third part of the presentation, which is our uh what we call the cortical column networks, which is basically um heavily inspired on the stuff that will be well known to to you guys. So, um let's kick off with the first part. Unless there are any questions, if there are questions, feel free to just interrupt me or flag it so so active entrance is basically a process theory of the brain.

A

So, unlike uh what you guys are doing, they're not looking inside neocortex to figure out uh how is everything working internally, but it's more like uh if, if we have a brain, what would the thing uh as a as a whole be optimizing? It's kind of saying this. This is kind of the potential uh optimization scheme that the brain does.

A

So, in short, it goes like this, so your brain or an intelligent agent builds a genetic model of its environment where the generative model is basically depicted as the joint probability, distribution between uh sensory observations or outcomes or actions a and some mean states. Yes, so basically, your agents resides in its environment.

A

It can do some actions on this environment which will result on some sensory outcomes the next time step. Now. What the agent tries to do is to come up with some internal state representation, as that should kind of represent the the hidden causes that that give you the the sensory outcomes and that's that kind of model, how actions influence the environment and how these states then give rise to your sensory outcomes, and the optimization of the generative model is then by minimizing the so-called free energy.

A

With this, which is basically an upper bound on your surprise or our prediction error, but yeah, we'll we'll delve into some, the more rigorous math in the next slide. But this. What is crucial here in the whole active inference idea, I think, is that not only you use this uh this objective to uh to learn or to to train your genitive model, but you also use this to to select your future actions.

A

So basically, the principle says your agent not only minimizes free energy for for updating the model, but also it will select actions that the agent thinks will minimize the expected free energy in the future, and this is a nice characteristic. I think because it basically renders perception, learning and an action vendor under one object.

A

So how is it work mathematically? So basically, the free energy is.

A

Is defined as follows, so it's basically expectation of the difference between.

C

A

Low likelihoods, the second term, is the general model, so the likelihood of the joint distribution and then the first term is basically a variational approximate posterior distribution.

A

So what's happening here well, what you want to do as an agent is not only have a generative model so that you can predict what's going to happen, but you also want to infer from the past sensory inputs you got which state you're currently in, but the problem is even if you have a perfect genitive model.

A

Getting this posterior is typically intractable, because if, if you use bayes rule you would you.

C

A

Have to integrate over all possible outcomes, so it's it's very hard to do and and just impossible to do in general. So the idea is that you, just uh and and this technique is, is basically called variational inference. It's well known, uh probably for the machine learning people with you. It says okay. Instead of trying to get this, this true posterior distribution, I just replace this by a simple distribution queue over the states.

A

uh I just optimize it uh and I make sure this is something very simple that is easy to optimize. For example, you you make it caution, for example, and you just try to figure out what are the means under standard deviations?

A

That's um that are close to this true posterior and basically, what you can prove is, if you, if you minimize this term the expectation term, then it basically boils down to maximizing the low evidence for your model uh and, at the same time, you're minimizing the gel diversion between this approximately sphere and the troop is here. So if you're, if you have a machine learning background, then this thing is basically known as the evidence for bounds.

A

uh What which they use in variational autoencoders, for example, so it's typically the same thing and what is nice is that decompose into two terms. First term is basically say: okay, if I minimize my free energy, what I do is I minimize the cable divergence between this approximate sphere and what we call the prior. So it's, basically, um what are the states that I think I will visit if I only know my actions, so it's basically you close your eyes and you think about.

A

If I do these actions, which are the states I will visit- and you want to be- you want these states to be as close as possible. As what will I estimate if I have my observations, so uh you basically want to have a good model both to predict and to um to infer.

C

A

State and in order to make sure that that there is information in your states, you also have the second term, which is basically kind of a reconstruction term. You try to predict given a state. What which kind of sensory inputs will? I will I see, and this is kind of a reconstruction term, so you constantly try to predict your next sensory input.

A

Basically and, as I said uh for the action selection, you you, you want to also minimize your free energy, but then the free energy for future time steps and then uh two things change first thing that changes is that now in the expectation, you don't know your observations yet so for the past, you always saw some observations, but now you don't know what's going to happen in the future, so you also will take an expectation over whatever outcome or sensory data.

A

You you, you might expect, and the second thing is that you also didn't act yet so your actions, you start to choose them, so everything is now conditioned on the policy uh pi, which is basically just a shorthand for any action you will do in the in the future.

A

Basically so, but because you didn't reaction, yet you don't know which which one it will be so it becomes a condition and then, in the second line of the equation, we do uh two uh we make resumptions to make this practical and the first one is that we.

C

A

Assume that in the future you have some prior so prior expectations of what you will see. This is basically saying I have some preferences. I assume that I will reach that. I will see these things because, because that's my preference, basically, I really reach my goals. That's what it says.

A

You can also interpret it as kind of more uh homeostasis kind of thing where you say hey. I expect my body temperature to be around 36 degrees in a celsius. So, regardless of um of what my actions are, I always assume it will be this way and by putting it like this, it will basically and force or encourage your agents to look for actions that will actually visit the state, so it becomes kind of a self-fulfilling prophecy. Let's say so. We call this the instrumental value or realizing your preferences are.

A

You can also cast it as rewards, like you know, in a reinforced learning agent. This would typically be your reward signal.

A

But, interestingly there's a second term, and here we make our assumption as basically here we assume that um our approximate just here is actually pretty close uh to the troopers here. So we just assume that we did a good job at modeling the world, and then you can replace this b with q in the in the second and then use what what arises is basically an information game term. So you want to visit states that, after the expected sensory outputs, you will know more about your state.

A

Basically, uh so this drives your agent to explore or to gather information about the environment or some epistemic value so enough with the mods.

A

What happens if we actually want to implement such such a thing in the real world? Well, basically, what we do is we we instantiate three artificial neural nets, uh which we call the encoder, the decoder and the transition model, so the encoder basically takes as inputs, your previous states action that you do and also your sensory inputs and you need to infer the stage distribution.

A

So it's kind of a probabilistic state representation saying uh which state am I in now the transition model um similarly takes the input, your current state and your action, but it it lacks the observation, so it doesn't know the sensory input it just predicts.

A

If I do this action, this is the state I will end it, so it's kind of what what you use for selecting the best actions possible in which state will I end up if I do this or that now.

A

Finally, the decoder will then predict the sensory observations that you see in particular states, and so you train this thing by interacting with the world and then minimizing the prediction: error of your recorder and minimizing the gail divergence between what you thought would happen according to transition model and what you think has happened after observing your sensory outcomes.

A

D

With some prior work, we.

A

We tested this on the number of environments, so the first environment is top left, is the the car racer. So basically, this this guard and.

C

A

The goal is to you, can trust it left or right, and the goal is to to reach the mountain, but the only way to reach the mountain is to first get some momentum, because you don't have enough trust to just get up the mountain for the first time. So it's a heart exploration problem in morale and what you see is if we, if we train.

C

A

uh Using some some some some data and fits uh the generative model, what actually happens? It learns to predict the outcome. So here in the right, you see in blue what what you expect to happen if you twist right and the red is what you expect to happen. If you first press left and then just to the right, you can see it early on, uh it does not know which momentum it has it.

A

It's like it gets the bush, but it doesn't know in which direction and it doesn't know so it so even if it drifts to the right it might expect to get there, but most often I think see I will not. I will not get enough momentum to to go up the mountain.

A

However, if I first go to the left, these are the red curse and probably I will end up there at some point and you can see how the beliefs shift that at some point, if it's it's to the left enough, then then somehow at some point they decide yeah. If I now trust to the right, I will always, I will always reach the top. So then, it switches its behavior to pressing to the right.

A

The second environment that we then tested was to have some more complex inputs. Basically, uh this car racer environment. So here you have top-down view of a car racer and you have to race uh on the track and we basically now train the model where the observations are. The the pixels draw pixels of the of the game and you basically.

D

A

Can still left or see right.

A

And then you can see that it actually learns to.

A

Raise it back so here you can see so so we remember that I I told about this preferred states, so here we say the preferred state is to be in the middle of the trick, and so that's why you see this behavior that if there's a corner it will tend to skip the corner because that's the way to get to the state which is in this case being in the middle of a track rather than getting a high score or something and then finally, we also tested this on um on real world robot data.

A

So we had one of our robots drive around in our lab, and so here what you see is basically the robot imagining what will happen if it's the same action, so you can see it's a bit more blurred, but for all the sensor modalities, it's makes uh sensible predictions of the robots turning inside one of these small hills. So you can see the camera mobility, the lighter scan, and this is a range doppler map coming from from a radar. Basically, so with this, we are at the point where we introduced our active insurance methods.

A

We showed that it works on a number of environments, and now we switch to one particular environment, which is our robotic manipulator and I'll hand over to dole.

E

All right is everybody able to see the screen now or yeah right, so in the in this particular environment, we want to apply the same principles. Tim just talked about um to a setup with a robot manipulator.

E

So in this case um we have um so we have a robot manipulator in simulation on a on a workbench and the the goal is actually to also learn um a model of this environment in which the robot can act. um So this, this environment is set up with a manipulator with an in-hand camera so to the gripper. A camera is mounted and the agent is the robot. So by activating the robots, the agent can move the camera around and observe the environment from different viewpoints.

E

So the goal here is to also learn, um learn a model of these environments so that it can know where each object is and by having a prefer a preference for a certain object, for example knowing where it should grasp it.

E

So in in this simple environment, we have a table and the objects can either be a cube, a bar, a sphere at random positions, random orientations and in random colors.

E

So similar to previously, we have observations which are auto and in this case it's a it's pixels of the in-hand camera. You can see them in small on the left of the figure and the actions here are the absolute viewpoint of the robots, so the the camera viewpoints. The position in uh absolute coordinates of the of the camera is here. The action by moving to this position and the state space is a 256 dimensionally learned, gaussian distribution with a diagonal convergence matrix.

B

So a quick question there uh when you say the action, um is it the absolute viewpoint of the robot? Is that which reference frame is that you're, saying you're, giving a position in some reference frame? What reference frame is that in.

E

Yeah now absolute reference frame so with respect to the robot base.

B

Okay, so it's not from it's not in the camera. It's not the viewpoint. It's not the camera reference frame, it's the actual external reference frame, so it has to do implicitly we'll have to do some sort of a transformation to.

E

Get from position: okay, yeah, exactly yeah.

F

Also, why is it called an action? I mean uh I mean how's the viewpoint in action, I'm missing. That is that, like is that like where I want the viewpoint to be, and then the the arm goes to that position, yeah exactly okay, so it's not a it's, not an action as in terms of flexing or extension or rotation. It's it's just say hey. My action is to get to this viewpoint, bingo that kind of thing. Okay,.

B

And does it know its own joint positions or it doesn't know the joint angles or anything.

E

It does not care about the joint angles, it's just the position and orientation of the camera. Okay, with respect to the robot base, okay, um so so it's it uses.

E

First of all, there is an encoder model uh which will encode and the observations and the viewpoints as pairs in uh into this um state space model that is actually uh 256, dimensional gaussian describing the global, the the global state of the environment, so the the positions of the objects with respect to the global reference frame and and these are encoded all separately and then integrated in each other, using a common-like filter um and then when, when an estimate is made about this, um these environments, you can sample from this distribution and by choosing a new action or viewpoint where you want to from which you want to observe the environment.

E

Basically, you can decode it into an imagined view, and this is shown um to the right of the decoder so in center. There is this imagined view where, where you can see, a black sphere is a black square and a yellow bar, um and this can be used, for example, to to drive the instrumental value which tim also talked about. So what if your preference- or your preferred view is, is also this uh is also this view.

E

Then then this will have a large instrumental value or this instrumental value will have a large weight in the free energy um and then, secondly, the the encoder is used again because from this imagined view, you can also re-encode it into the believer over the environment of over the work or over the workspace.

E

And then you get a new distribution describing the environments.

E

Imagining that you would have visited this novel viewpoint, and then you can so that that is marked by the blue arrow here and you can compute the epistemic value or the expected information gain over this term, and you estimate how much you expect to learn by visiting these novel viewpoints.

E

That's basically the scheme of our first work in the uh using the robot manipulator.

E

So we we can then learn. 3D manipulation models from an enhanced camera, so we have observations in the left and then on the right to reconstructions from the same viewpoints. So initially you can see that it's predicting gibberish, because it really doesn't know anything. It has observed only a single thing, but when more observations are added, you can predict accurately the following ones or more accurately and then, as the as more information is added, the model becomes more accurate, for example, when it observed this blue bar entirely. It started to reconstruct it correctly here.

E

And then, then, this model can also be used for visual, visual foraging so again by minimizing the expected free energy and by balancing this epistemic and instrumental value. We can provide the agent with a preferred observation such as this blue square. Here and initially it does not know anything about the workspace, but when, when we drive actions initially, it will go up to a high vantage point to observe the environment and to acquire more information, and once it knows where the where the blue cube is, it can move towards it.

E

And secondly, afterwards we can use this model by querying a continuous space of views and we can move around in the imagine space of the workspace.

E

D

E

This model still has a lot of limitations um because uh it because it all uses um because it uses a lot of data to train. So we need 8000 000 different scenes of different configurations where only five primitive objects are used.

E

This model uses a recurrent information integration, which is a bit slower, and it also requires for a long training time so we're speaking of in terms of weeks of training for this this model, uh and then it's it's because of the data requirements. We can only really do it in simulation, and this does not really transfer well to the real world, because it's not realistic enough and we cannot accurately represent a real world object.

E

So um so. This is why we needed a different approach to how we, how we represent these these kind of things and then um and that's how we got into what we call the cortical column networks, um because we took a lot of inspiration from the thousand brains theory.

E

um So we wanted to build an elementary sensory motor structure, which we can train in a unified way which we can replicate over and over. So if we need to learn more different objects, more different things, we can add in more of this same component.

E

So this is similar to the vertical columns in the neocortex, um and then we want to model as well how the sensor will move with respect to the object uh and so work more in the object, centric reference frame rather than a global one, because then the um well, it's more difficult to learn all different possible configurations of the workspace rather than with respect to the object.

E

And then finally, uh we want to integrate information over time by a voting mechanism and not by a recurrent mechanism, because then the information of the boss will not be forgotten as easily so in. In the end, we will create a ccn or column network structure which takes an input, pixel-based observation and will vote for a particular object. Identity and a sensor pose in an object, local reference frame.

E

But to do this, we we needed to modify our setup because it could not be done right away in the in the setup we we had previously.

E

So we are now considering um nine different objects of the of the ycb data sets, which is a new data, statistical, robot manipulation, um and so now we have an environment in which we have a single object with identity.

E

I so in this case this is the sugar box and an assimilated camera, which is a camera with viewpoint, vt or the viewpoint at times of t with an observation, ot correspondent, then the actions now are a translation and a rotation of this camera viewpoint and applying this, this relative transform to the viewpoints we get a new new viewpoint, vt plus one again with the corresponding new observation, ot plus one- uh and we can also describe this in a generative model, which is also necessary for the for using the active inference framework.

E

So here in this generative model we have um well. We have the observations o, which are dependent on the object identity, which does not change over time. So it's a variable that remains the same and it depends on the viewpoint of the camera which doesn't does change over time, uh given an action, so the viewpoints depend on the previous viewpoint and the action um applied to it and now once we have this generative model, we want to draw inference about it.

E

So when we have an observation ot, we want to know, for example, which is the object, so we we want to infer the object identity, given our observation and we also want to infer the camera pose, given our observation.

E

So how do we? How do we do this exactly in in practice, so.

B

I have a question on the previous slide. um It sounds to me like here. The viewpoint is still in in the camera's reference frame or the it's not in the object's reference frame. The way is described here is that is that correct or.

E

um Well, the viewpoint is in the object, reference frame, in the sense that the object is always in the center of the view and because of the transforms are being relative so um because, because those are relative, the the transference is in its internal reference stream. Basically,.

B

Okay, okay, so the object is always at this, given center location.

E

Yeah but it's yeah yeah, but because the the actions are all relative to the view. It's it.

B

Doesn't matter.

A

Yeah literally okay yeah also here, the the aliens does not know viewpoints per se, so it has to infer them and by the and these representations, as this will will come apparently next slide will also be learned, so it has no, uh it has no longer an x y z and a quarterly representation of the viewpoints.

A

It just builds a representation and learns how an action will shift in this learned representation.

A

C

Yeah and the action selection is random. That changes from viewpoint search one.

E

um Well, the action selection is still based on active inference, but I will get to it later so for for learning the model, initially it's random, but then uh later for inferring identity or for moving towards the target post, it will be driven through the active inference.

E

Free energy function. Now.

E

All right now I'll go to the next slide, um so so how we do? How do we do it exactly so we when we have an observation of, for example, in this case the cracker box ot.

E

We encode it through our encoder model and this will output two different distributions. So, first it will output, object, identity or what's distribution, and this is actually just a binary bernoulli distributed variable which will be. uh We should output one in the case that this critical column network is dedicated to cracker boxes and should be zero in any other case. So, for example, if I would input a mustard bottle, then it should be zero should output to zero and, secondly, it will.

E

It will describe the the observed. Object pose in a uh again in a state space of a gaussian distribution with a diagonal covariance matrix, um and uh please note that it's that this does not is not a an explicit representation. But it's implicit in the sense that it's never matched to the so that it's just a latent representation of this pose.

E

And then um we can sample from this distribution where we have a vector describing the object's pose, and this can be decoded back into an image or into the expected observation overhead of t.

E

And then we can apply an action which is the or we can apply an action to the sampled uh vector v hat of t uh and apply it um and transition this to acquire a new distribution over the the pose which is in the in the same latent dimensions as the uh the where distribution.

E

So so. This is a single um ccn as we describe it, where we have both the identity and the pose in a separate distribution.

E

And then we we use these to. Are we we optimize these using the free energy functional again. um So the the first term equates to what uh tim described earlier. But then, with a specific generative model, I described a few slides back, but in a sense it comes down to three terms: the first one focusing on object: classification, the second one focusing on the optimization of the pose estimation term and then finally, a reconstruction error on the oh yeah on the observation.

E

um So these these ccns are then used for uh for voting for voting for object, identity and- and this is done because each ccn actually only has a single output variable.

E

So when we acquire a new observation, we pass it through each of the different ccns and we acquire a prediction for each one and then um to to convert this to a to a distribution. We push this through a softmax function and acquire a categorical distribution. So in this case the the final one is the one that describes it as a mug, and then this will be the well. This should be the the largest value.

F

I missed a basic idea here: what is the difference between different cci's? What what is there? Oh.

E

I might have forgotten to say it, um so each ccn focuses on a uh on a single object, so yeah here.

F

So if they so each one is trying to recognize a particular object. um So then, how would they vote because they they can't they're just they're, just basically competing with one another in a sense.

E

Yes, exactly so because hccn is trying to identify the correct objects, are the it's it's corresponding object if it's, if it would, if we, um if it would acquire an observation of a different object, then it would output a small value in this distribution, so it would be very uh well it would it should be unsure about or it should not.

E

How do I say it it should not. It should I'll put a zero for observations that do not belong to its category.

F

Yeah but but then they they really can't vote to reach an agreement right because if they're in some sense each is competing with one another.

E

Yes, yeah exactly so the the ccns at a single time step are not necessarily voting. They just cost a single vote at a single observation in our implementation, but then over time we will accumulate votes uh to create a more accurate prediction.

F

But it's this is not to be critical, but it's it's kind of fundamentally different than the way we've described. Cortical columns in the brain, which are all voting on the same sets of objects and from different viewpoints. Here, you've dedicated a column to each object and accumulating evidence over time for each one independently.

E

Yeah, I think you're right, um and we also uh discussed this. We also mentioned this in the discussion uh that we we did. We took some inspiration of it, but we're not modeling the vertical column, exactly.

F

That's fine that just makes you understood. Thank you.

B

Have you thought about doing something where each cortical column is looking at a different segment of the input space.

E

uh We we have not yet done this, but we we did think about using not necessarily on different aspects of the input space in of the camera observation, but we did think about doing it with different sensor modalities. Oh.

B

Yeah, that would be this yeah. That could be work too yeah.

B

That would be in some sense. uh That would be more interesting in some sense.

E

Yeah, because now we assume that the entire retina or the entire pixel observation is actually a single sensor right.

C

E

B

I missed that. I may have missed something also with the encoder and decoder networks. What are they? Are they deep learning systems, or you mentioned, the output, is gaussian.

E

They're they're deep learning, neural networks, um typically so that the encoder is a convolutional neural network. The decoder is a also a convolutional neural network and then the transition model is a feed forward, neural network and I could say, distribution by predicting the mean and the variance separately.

E

B

And you and you trained the whole thing with back propagation uh as one's full joint.

C

B

E

Yeah exactly yeah, it's trained influence, um all right so and then uh well. So this is how the ccns uh make a single vote uh or our type of vote. And then we want to um aggregate information over time, so aggregate the predictions over time as separate votes. And we we do this because we have now a set of categorical distributions.

E

uh So we model it as a direct led distribution as a conjugate prior and the parameters of the direct distribution are basically the votes that are acquired over time. So um and then, because we want to. We want to drive our agent to choose viewpoints to.

E

Improve the classification accuracy as fast as possible, we choose them through the minimization of the expected free energy g, which is again a term that unpacks in three terms where the first one is, the state preference. The second one is information gain on object, identity. So how much will it learn uh in terms of object, identity when performing a given action and the final one in information gain on object pose because we now care about object, identity? We will look at uh um yeah this term exactly to to drive our agents forward.

E

Now we can see that um given more more views or by choosing more views, the accuracy of predictions improves over time uh given are compared to an agent that does not move and but more interestingly, we also look at the views that the agent actually selects. So we can see that for, for example, for this blue. Can it chooses a view where the can is clearly visible and can clearly differentiate between the logo on it?

E

While the on the right, you can see the view that it's most certainly did not select, so the view with the highest free energy and that's just the top of the bot of the can, so it could be a soup. Can a pringle scan master chef can so that it it it's very ambiguous and it is clearly reflected in these observations.

E

uh A second example is on the bottom, where you have the blue markings on the mustard bottle, while on the right, there is just a yellow blob which could be a banana or something else.

E

um And then, as a second use case, we also use these models to to drive our agent towards a preferred pose using active inference. So a desired positive, for example, for grasping, is where we can now provide a visual of an object and then it will drive or it will find the action that will move the camera to view that exact visual.

E

So again, we minimize the free energy, uh but this time with uh well with respect to the action or the relative transform on gamma pose. But now we look at the first term, which is the the state preference on object, observation and then we can see so for in the top row. You can see the targets in the second row. You can just see some random movements in imaginative space and in the bottom row we show that a trajectory chosen chosen by the active inference agent and it always moves the object into the target position.

E

So in the discussion we we now have some some separate learning of what and where, from pixel, based observations in an object, sensory reference frame. um We base ourselves on the principles of the active inference framework and the thousand brains theory.

E

However, it's clear that these are not critical columns as we as we operate on the complete sensory inputs as we discussed earlier, and we also focus on one single object category rather than multiple categories and separating on different sensory inputs. So it might be more like a mini column, but I I might be wrong here. Yeah.

F

I probably would yeah, I think, technically, that's probably not correct, but yeah.

C

E

Right and um yeah, we also do not have a hierarchical distributed or sparse representation, or it's more. It's more advanced this representation uh in our case uh and our futures uh ideas are to to put these objects in in the actual, robotic workspace, and then, if the, if the object is recognized, we can use it to to do a certain task as preferred or if it's not, we can inspect it and capture data and train a new ccn in a more continual learning kind of setting.

E

So this was the the presentation. So thank you for listening yeah. Thank you.

A

That's great, thank you.

B

I'm curious about the static versus the dynamic chart that you showed. How was the stat, how did you do the static network, um the static system.

E

uh Which static system do you need.

B

Oh, you had one where you're allowed to move and where it's not allowed to move. uh You showed that you had the accuracy charts over the number of views.

E

Oh yeah, so yeah we just initially picked a random pose, so the the static agent is is actually the. So the first step is exactly the same as you can see yeah and we just picked a random initial observation for the nine objects in I I forgot how many settings, but we sampled a lot of different settings to acquire um these bands.

C

B

Now the accuracy is still pretty high for the static agent with 92.

B

I wonder if it would be different if the objects were more similar in some sense here. You know if you have a color in with the color, you could probably tell a lot apart and maybe if the objects were much more confusing that it might be even a greater separation, potentially.

E

Yeah, that's exactly right and that's something we. We also plan to investigate further.

B

Like a confusion matrix to see what the errors are,.

B

That might be interesting.

F

That's true yeah in some sense, by picking those objects, you might have made a problem easy too easy.

F

E

Yeah, that's true.

D

You said uh previously the training took weeks. How is it now with the new setup? Is it faster now to learn.

E

C

E

This in this current setup, I think uh for one ccn, it trains like three or four hours uh on the same hardware. So it's it's. It requires way less data because you don't have the you, don't have to learn the global reference frame because it's all in the same last space.

C

Now, in the last experiment, how do you define the preferred poles for given that, like grasping.

E

Yeah by a single, um so this is done by the by providing the pixel based observation. So then we we provided by the observation of an object in a given, pose and then we say: okay um now encode this, then you get a pose. Then you get the distribution over the post and then it will find actions that minimize the distance to this.

A

Into this, so basically we sample a lot of actions and then we we see which ones match best and then we refine basically.

D

And did you test the previous setup also on on this task, with a single object um just to compare or only.

E

D

E

Yeah, we did not do this because we have a completely different environment so well I guess we could. We could have trained the the old model um using using absolute viewpoints as well yeah. We did not compare this.

A

But then it would take four weeks again.

D

F

One of the things I found really interesting to think about what distinctive talk is something we're working on too is or thinking about. Is you have sort of a detailed biological model? And what aspects do you maintain? What aspects? Don't you maintain as you're, trying to build something practical?

F

So it's interesting just observe your choices, um which are you know some of them are fighting on the brain like at all right. So, um but but that's, okay! It's it's! It's an interesting question that we have to deal with in the long term. um You know this is just a very high level observation. It was useful to me to see what you're doing, because you've made certain set of choices.

F

You come from the three energy principle world and you like to keep that um you know so you're trying to you're trying to fit that with a thousand brain theory. Other people might come from a different direction and try to fit that with a thousand brain theory. If you come from a purely biological direction, you you might not do any of those things, though not that one is right or wrong. It's just interesting.

F

So to me to see this was a very useful exercise which I appreciate.

A

Think we're also looking at it from from your societies what we have now, how might we move even more towards the direction of biology to check whether it might even work better, so um we're discussing a lot internally? Rather, we should uh look more into the another neuron model that you, you guys propose, that has more uh nice characteristics on learning or you know.

F

We're doing the same.

A

F

With we're doing the same thing here right now, so yeah.

B

We're trying to figure that out too.

F

So um yeah yeah, it's just pretty fascinating. I think we're on a sort of the forefront of things I think are going to be pretty comprehensive, pretty essential in the future of robotics and ai and just you know, we're discovering and trying to figure out what's important. What's not.

B

B

You know if I understand the way you're doing it, the if the object is on its side or some. You know the object itself uh changes. It doesn't matter at all, because it's just completely in the reference frame of the object.

G

Where you're doing it right now,.

B

Is there really no notion of the object being on its side.

B

Have you tried, uh I guess one going back to some of the stuff we were discussing earlier? Have you thought about keeping the same setup but allowing each oracle column to model multiple objects um and what impact that might have?

B

Because now I guess as you this is, you might have a scaling issue now that, as you add, more and more objects, you need more cortical columns. But if you were able to keep a fixed set of cortical columns but allow each particle column to model multiple objects, um you know there might be a potentially a scaling benefit.

A

Of some sort yeah for sure, because also your glutes, you could suppose you have really a lot of a lot of objects and you you so they're, just speaking numbers, because you have ten thousand objects and you have thousands cortical, combs or archival networks, and you have each one of them, learn about, let's say: 20 objects and then then it would be more like costing votes as in a thousand brains, because then each critical column that was that knows about one of the 20 objects, will cause the vote for that object.

A

But then the exact bits that turn to one in the in the thousands votes are then more like a sparse representation of that particular object. Yeah.

C

A

That we'll have to scale it to a lot of objects. So that's the challenge.

F

That's a challenge because of the time it takes to train.

A

Well yeah, so we probably first need 10 000 objects that we can render from generate data.

F

Oh, I I don't know why you need so many objects, I mean it seems like you could do it with a limited set of objects. Yeah. I think the idea here is to distribute the learning of the objects over multiple columns.

F

Each column could learn a different set of objects, sub-sampling from the entire set um and- and so you know, each one has a limited view of the world. I know 10 and you know 10.. Someone else knows 10, but we're all the 10 we know are different, but it doesn't seem like you have to have thousands of objects to make that interesting.

F

You know you could have 100 objects and still be pretty interesting, but.

B

Yeah I mean this benchmark has something like 100 180 objects, I think, or something, because we looked at the same benchmark before so that you could try that, and some of them are are confusing. They are quite similar to one another, and so that could be a good test case.

A

Yeah another question would be, then: how do you, how do you search which objects get assigned? To which column you could do it randomly or use.

F

Well, I think you know in the brain: it's it's not random in brain it has to do with where they are connect. What they're connected to what part of the sensory rates to do, but in a non-brain-like way you could just say you can just say you can just randomly sub-sample, you know so you say you have 100 objects and you have 50 columns whatever you could just say.

F

Each each column is going to randomly we're going to randomly subsample it from 100 objects and each column learns whatever some number, and it would still be a very interesting exercise in the brain it wouldn't be like that. I mean some columns in the brain are going to be like their vision, columns and their hearing. Columns and they're touch columns and they're representing different parts of the sensory array, um but you don't need to do that here that um it would be still a very interesting exercise.

F

We were just talking yesterday here about how we could introduce the concepts of voting into uh machine learning systems that are not full brain like right, so that this that's what we're talking about right now is, you could say: okay, we have multiple modeling agents in some sense, they're all equivalent, but they just learn different subsets of the of the objects in the world and how that you know- and then you get, then you get voting in the sense that everyone who knows about the coffee can or whatever it is they'll get the vote.

F

But um uh but you have this sort of interbreeding of votes, because it's a distributed randomly distributed um distribution models, it would be very interesting. I don't think it could be a lot of work. I'm telling you what to do.

F

That would be a an interesting direction to go.

A

Yeah, the tricky part I think in our circuit right now as it is right now, is that now the the reconstruction decoder trades very fast and very well, because it really just notes this one object, whereas if you would, if every column would encode a multitude of them, it would also have to encode somehow, which particular object of this stand.

A

Should I reconstruct now to be to be correct, so it will have a harder time to decode and it will need some some part of or another latent space or some part of the viewpoint, latent space that we use now to encode this so right, yeah. Definitely.

F

Well, that gets into the details of how you're doing this. I don't you know the free energy principle stuff, I'm not, you know not detailed familiar with it, but another. Another thought you could think about. Is that- and I don't know if this applies to the way you're doing it, but you could say okay, each. If I'm, if I'm distributing the model of the the chips, can onto different columns some subset of the columns, then those models somehow could be inferior, that is, they could have lower resolution.

F

They could have um they could they don't have to be as good individually right and just like some of your visual columns in your cortex, they look at very large parts of the input space in the retina and those are going to be very fuzzy and they're, just not going to be very detailed right, but they're still useful, um and so in that sense you might be able to make a bunch of columns that are simpler and take less time to train or have less resources. um Again.

F

I don't know understand the the way you're modeling comes well enough to know if that's possible- and I I see what you're saying that it could be difficult to have model multiple things, but that would be a definite way. You'd want to go. Oh, these are the way we would want to go. If we were going to work on this uh just to get the columns to model multiple things, don't have to model a lot. You could just model a few things, but you have to get to work that way. Somehow that would be.

F

I don't know if that's possible, using the way you're doing it, but.

A

Think so, yes, it's definitely definitely worth investing. It sounds very interesting as well.

B

I mean another uh possibility, is you know, and this might be closer to what a coracle column might do in biology, is that it becomes more of a generic encoder for visual images. um You know it picks up on. You know the visual.

D

B

Are cuts across all categories and then the representation can you can use that representation to decode again.

D

It might not be completely accurate.

B

But it'll still be close enough. It should be able to reconstruct any visual image. At that point,.

G

B

That's another.

A

Kind of direction to go in yeah, that's also something that we thought about, um because now every column has to learn some basic things over and over again yeah kind of get some more shares uh parts in this thing, yeah.

G

One thing I saw in the paper looking it up well, one thing I realized I was like oh.

D

These are like.

G

Capsules or each capsule that stands for one object. That's that's.

D

G

Columns, that's not what we do. Capsules do, one capsule per object class and then.

D

I see that you pointed out the same.

G

Thing in the paper, but I'm just drawing that connection in case people didn't notice.

F

I didn't understand that about capsules. It's interesting.

G

Yeah yeah bill did it and and jeff hinton made the same connection to many columns that he thinks of a capsule as being like a mini column.

F

Well, I I think, that's absolutely not true. So that's that's someone speaking not knowing what many columns actually look like, but um but it's okay to say that with inside of a you know, I don't know you could say, there's a subset doing something so.

G

There's like this, you could think of this as like capsules, but with movement and a sort of voting that kind of resembles our voting.

F

And that makes it interesting to think of it. That way I mean, there's we've looked at, you know it capsules and we say: well, okay, that's pretty far, it's pretty far removed from the thousand brains theory, but there's some things in common, like they're, trying to get capture some sort of relative position of things um and then you've. But you know the capsules have no concept of motion at all right and so you've got some concept of motion here. So these are also intermediate uh things.

F

It's pretty interesting.

A

Yeah also, as far as I know, the the capsule does not have like the concept of a reference stream. It's more like um yeah. Every capsule learns about some features. That yeah should should then have it's in the 2d image. There is some sense of where is it in the 2d image, but it has no sense of like I'm in this 3d space, where I can look around just so.

G

That's fair. I I think the original paper planned to do that, but then the later capsules paper, I think you're totally right got away from that. Yeah.

C

I have a question from voting: can you explain that your audit was a bit of uh so right now? Each quarter column represents an object and then you're voting by taking a soft max over a pernodial variable. How are you going to vote when your each column represents multiple objects?

C

Because then you can no longer do that right.

A

But yeah, so so, if every so, how I would see it is, every column represents multiple objects and instead of doing the softmax, you would just um you just get the the output of the bernoulli and maybe maybe even kept it two to zero one. And then you take the vector of all your columns. So suppose you have 100 columns and every column will have a zero or one output.

A

You concatenate these in a vector, and hopefully this will be a sparse representation with only a few ones, for the columns that know about the objects that you're seeing. Oh okay,.

C

C

B

Other questions.

C

Well, actually, sorry, yeah! I have a follow up on that um how'd, your back propagate with that.

A

G

So I think you can.

A

Still trade, each one separately, so um so each each cortical column will be trained on the set of objects. That knows, and it will treat other objects as negative examples for the bandwidth variables. So then you, you train each column in isolation.

A

Using that problem, all right, yeah.

A

But yeah this is just top of our minds.

A

B

C

B

Anything else you guys want to discuss.

A

I think we got a lot of a lot of great inputs already. So thank you for that. um Let's just say: let's keep in touch uh uh and maybe uh maybe we can collaborate on this or um send over a student for an internship or something just to.

F

Yeah, that is something we've done a lot in the past um pre-copied. um We could do it now too, but uh we're really open to that. We, like that kind of uh you know having someone sit here for weeks or something it's useful.

A

Yeah, that would be relaxing.

F

So, just uh that's we're certainly open to that, and we've been posting um again. I just want. I want to thank you guys for doing it. I think it's really fascinating to see how you take some of these ideas that we've come up with neuroscience and apply them to machine learning, as I said stuff that we're trying to figure out how to do too so you're ahead of us. So that's good.

F

We appreciate it very much.