Numenta Sparse Distributed Representations, 29 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sparse Distributed Representations: Our Brain's Data Structure

Description

"Sparse Distributed Representations: Our Brain's Data Structure"

Subutai Ahmad, VP Research, Numenta

Numenta Workshop Oct 2014 Redwood City CA

A

Okay, so I'm sabotaging I'm going to start talking a little bit about sparse, distributed representations, but now before we get started, I do want to warn you that this talk is quite different from jeff's talk. Jeff gave a very broad overview of the theory, I'm going to go fairly deep into one particular area and just fair warning that there is going to be a little some parts of math in there a little bit of math in there.

A

Now. You don't really need to understand the math to understand the gist of it, uh but we felt it's important to go through some of the math, because we're trying to come up with some rigorous ways of understanding the the behavior of of cortical systems, so just fair warning. There is going to be a little bit of math there. I know for some of you, that's not really your thing, and so, if you want to take this opportunity and go outside and make math jokes then we're I will not be upset so, okay, how many?

A

If you get this.

A

Okay, all right, so um our goal at numenta is to try to understand the computational principles of the cortex and then to build intelligent systems based on those principles and it's useful, sometimes when we do that to look at things from a computer science perspective.

A

So what I'm going to try to do is share with you some of the progress we've made in understanding one particular part of it, which is how information is represented through sparse, distributed representations and how it's used throughout the system. Okay, so before I really get into the details, I want to discuss a little bit about the role of sparse distributed representations. Where are they used? How are they used?

A

Well, it turns out they're pretty much used everywhere in cortex and just to give you a flavor flavor for that, I'm going to walk you through one particular one simple example. So, let's say you're playing a musical instrument. Okay, as you're playing your auditory system, is listening to the musical notes and in your auditory, cortex. A small percentage of the neurons at any point in time are responding to the particular frequencies that are being played. They're highly tuned to specific frequencies, the rest of the neurons at any point in time are silent.

A

So there's a sparse pattern of activity. That's representing the sound that's coming in in through your ears.

A

Similarly, in the visual system, there are neurons that are responding to specific spatial frequency patterns and colors, and so on and there's a small percentage of the neurons in a visual system are active at any point in time and there's a sparse pattern of activity. That's representing the visual scene, and indeed you know ev, all all of your sensor areas are representing the sensor information, that's coming at any point in time as multiple sparse distributed representations.

A

Now, let's look at the other end of the cortex.

A

Your frontal cortex is involved in planning and it might be involved in planning what piece of music you're playing or how you're playing it or the tenor and so on, and so it's creating a plan and that plan is represented and as sparse distributed representations and those are fed back into lower level systems which are then executing that plan, so plans and the frontal areas of the cortex are using sparse, distributed representations and feeding that back to lower levels of the cortex.

A

If you look at the motor areas of the cortex, there are neurons that corres correspond to specific muscle movements and specific motions. Those are represented as sparse distributed representations.

A

Those areas are influenced by the top down sdrs that are coming from higher levels and and each neuron is deciding whether or not it fits into the plan at this point in time and executing a series of instructions, and so you have a sequence of sdrs, that's generating. Let's say your finger positions on a violin or head movements or whatever it might be, and similarly your brain is making predictions about what sounds it might hear. Those are represented as sdrs.

A

Your attention system is using sdrs to decide what areas to pay attention to and what what to ignore and really sdrs are the foundation for all cognitive functions across all sensory mechanics modalities. So it really is another we think of it as the brain's common data structure, and so what would like to do is to try to understand this data structure analyze it, and maybe this can help provide a more rigorous foundation for understanding cortical computing. If we can really analyze it, we can understand the behavior of the system.

A

We can understand how it scales, where it works well and where it doesn't, and what are some of the design parameters of the system. So that's kind of the goal of this effort, and then this is really sort of early initiative at going in this direction.

A

Okay, so here's kind of an outline um really just two parts of the talk: I'm going to spend a little bit of time on kind of the basics of sdrs and sparse disability representations, and then I'm going to go through a list of fundamental properties that we can derive based on what we know from the neuroscience and from math, and I'm going to talk about things like error, bounds and scaling laws and so on. Okay, so let me introduce you to an sdr here. It is so this video loop was created in professor hassan's lab.

A

What it shows is. It shows basically a patch of cortex in mouse cortex and each individual light. There is a single neuron, that's firing at a particular point in time, and what's really amazing about this, video is that this was recorded uh while the mouse was performing complex, cognitive tests. So this is not from an anesthetized animal. It's from an awake animal performing complex task. So your brain right now as you're listening to the speech, if you were to poke inside it is looks like this. So these flashing lights.

A

This is the language of the brain.

A

Okay, so what let me I'm going to go through and list some basic attributes of sdrs what we know from the neuroscience okay, so the most basic thing is um at any point in time, there's just a small number of neurons that are firing. Okay, there's just a few of these, these bright lights at the same time, there's a lot of black space here, there's actually a very large number of neurons that could potentially be firing at any point in time. Okay, but only a small percentage of firing.

A

So that's where we get the sparsity and the underlying dimensionality is actually very large.

A

Every cell here represents something and has some semantic meaning. We know that neurons in cortex tend to be fairly highly tuned to specific patterns, and so every point in there actually has some meaning it's not some random bit. That's coming on, and at the same time no no cell is critical. uh You can destroy a good percentage of these cells in the system will work just fine and in fact the information is distributed across the cells you, you might have a cell.

A

That's focused on a particular orientation of an edge and you'll have other cells that overlap with it that respond, maybe not as tightly to that particular orientation, but to some other orientation. So the information in cells is distributed and no single cell is critical. So that's a very important property of sdrs.

A

Every neuron we know from the anatomy only connects to a subset of other neurons. We do not have full connectivity between neurons anywhere in the brain, and this imposes certain constraints on the on the system and the operations that you do on sdrs have to follow this constraint.

A

We also know from neuroscience that whatever the data structure is, it has to enable extremely fast computation. The cortex can recognize amazingly complex objects in a very small number of steps, sometimes as small as 20 or 25 steps. It can recognize faces and- and you know, animals and and so on, so that is it whatever. The data structure is there's not that much room for a lot of iteration and so on.

A

It has to work very fast, so the data structure has to enable very efficient computation and the last property is that sdrs are by and large binary. Now there are some cases where you see, maybe perhaps some non-binary activation, but what we found is there's so much room in the sdrs. We can pretty much represent anything that we need to do with a binary code and for the rest of this talk, I'm going to assume that everything is binary.

A

Also in the rest of the talk I'm going to represent sdrs, not as these black squares, but as a as a vector as a binary vector and in this vector uh the positions there uh represent individual cells and if the, if a number is zero, that means the cell is not active, and if the number is one that means the the cell is active. Okay. So we're going to look at binary vectors that that look like this.

A

Okay, let's look at a single neuron. How does a neuron operate on sdrs? Well, each every cortical neuron gets a number of different sdrs as input and jeff went through some of this. In his talk, we have sdrs that are coming from from above we have feedback sdrs. We have context sdrs, whether it's temporal context or other context, and we have bottom up sensory sdrs.

A

So we have a each neuron is getting a bunch of these sdrs coming into it and at the end of the day, each neuron then represents one bit in some output sdr that the rest of the system is going to see okay, so it gets a bunch of different input sdrs and it's going to output one bit in this in this vector. Let's look at this in a little more detail um on the left.

A

You have the the pyramidal cell neuron and on the right we have our model neuron again jeff went into this in some detail. I'm going to look at just two aspects of this. First we have the distal dendritic dendrites. They are getting the feedback sdr and the context sdr into it and in our model, neuron they're represented with those blue synapses there, the up and- and we tend to have you know hundreds you know 100 to 200 of these distal dendritic segments.

A

Now in the brain, each of these segments are actually fairly independent of each other, and each segment is detecting a particular unique sdr using a threshold operation. So if you have, those blue dots represent individual synapses, individual connections to those sdrs. If enough of them on are on, then that segment will say: hey I've detected this sdr, but each of these segments are operating independently with a very simple kind of threshold computation.

A

The second thing I want to focus on is the proximal dendrites and those typically get bottom-up sensory sdrs and those are represented with the green green dots there and they also represent multiple patterns, but in a very different way. The proximal segments represent dozens of separate patterns in a single segment. So here we have a bunch of sdrs that are kind of smooshed together into one segment and somehow it's able to uh recognize each one independently, okay, and so these two basic types of operations I'll go into in more depth.

A

Later and again, um you know, in both cases each synapse here corresponds to one bit in some incoming high dimensional sdr. That is the input to the neuron, and then the neuron is going to output, one bit in some output sdr okay. So what are some of the properties that we want to go over?

A

First oops we're going to talk about capacity and show that sdrs have extremely high capacity.

A

Second, I'm going to talk about how we can recognize patterns in the presence of noise. How well can we do that? What are the the bounds of that?

A

I'm we're going to show that sdrs can are very robust to random deletions. So you can, you know, delete synapses, neurons and so on, and it can still perform very well.

A

I'm going to show how and talk in detail about a particular property that I find very fascinating, which is that you can actually represent a dynamic set of patterns in a single fixed structure, and this is part of what's going on in the proximal dendrite that I mentioned, but it's also going on in other parts of the system, and this is a really interesting property of sdrs that I don't think we see with other fixed representations and, lastly, I'm going to talk a little bit about the efficiency of it and show that sdrs really are promote very efficient, computation.

A

Okay, so we're going to go through some of these in in in detail.

A

Let's discuss a little bit of notation here, so I'm going to represent an sdr vector as a vector with n binary values so where each bit represents the activity of a single neuron. Okay, so I've shown an example there x, so you have n different bits and they're either going to be 0 or 1 depending on whether a cell is firing or not, s is going to be the percentage of on bits and I'm going to use the letter w to denote the actual number of on bits in the representation. Okay.

A

So if you have a vector w is simply the the cardinality of that the number of on bits in there. Okay and here's an example. I have two different sdr vectors in this case. N is 40., so there are 40 total elements in each one where there's 10 percent sparsity. So you have about four bits on it at any point in time. So that's a pretty small sdr vector typically in our implementation, we use much larger numbers and these correspond much more closely to the numbers you see in in a layer in biology.

A

So we typically use n, uh the value of n is somewhere. You know between 2000 and 65 000 somewhere in that range, so these are pretty high dimensional vectors sparsity that we work with tend to be anywhere from about .05 percent, uh when n is really high, all the way to about two percent- maybe four percent somewhere in that range and the value of w that we typically use are, is around 40..

A

Okay, so that's, and it turns out that there's a reason for these numbers, which comes out of the math here, okay, so first, let's talk about capacity. This is fairly straightforward, so the number of unique patterns that can be represented in a vector simply n choose w. There are w on bits out of a possible n bits. Now. This is a lot smaller than 2 to the n, which is what you would get if you had a dense representation, but this is far more than any reasonable need.

A

You might have so, for example, in the the range that we're dealing with of let's say, n of 2048 and w of 40. The number of unique patterns is actually 10 to the 84th or greater than that which is way way way greater than the number of atoms in the universe. So um you know it's worth pointing this out, because people have been concerned, you know if you have a sparse representation, maybe you're losing something but there's actually tremendous room in there to represent really rich concepts.

A

Okay. Similarly, if you took two random vectors, the chance that they're, actually identical, is basically zero. Okay, it's one over that number. So it's extremely unlikely that if you were to pick two random sdr vectors that they're going to be the same okay, so that's capacity, um I'm going to talk a little bit in detail about how we can recognize how well we can recognize patterns in the in the presence of noise and I'm going to need to develop a few concepts along the way.

A

So, first of all we're going to talk about similarity metrics. So if you talk about recognizing patterns, you want to know when two patterns are similar to one another and if they're similar enough, then you say they're that you recognize it and with sdrs we don't use typical vector similarities, so neurons cannot compute euclidean, distance or hamming distance, or anything like that. That actually requires full connectivity between layers, and we just don't see that the similarity metric, we're going to use is called the overlap metric.

A

So the overlap is simply looking at the number of bits you have in common. You can think about this as sort of the opposite of hamming distance hamming distance. How many bits are different here? We're only concerned about the shared bits here, and this requires very minimal connectivity. If you have a vector with 40 on bits, you only need to look for those 40 bits in any other target. Sdr. Okay, you don't need you don't care about the rest of the bits, so it can be very efficient and mathematically.

A

You just take the end of two vectors and then compute the length that gives you the the overlap. Okay, we can also define a match. So we say we detect a match between two vectors if they're close enough. So basically, if their overlap between two vectors is within some threshold, then we say they match. Okay,.

A

So here's an example: here's our two friends again x and y, and in this case there are three positions where the bits overlap, and so we say that they have an overlap of three and so they match. If the threshold theta is three okay, they would not match if the threshold was four okay.

A

Okay, so how accurately can we match with noise to kind of build this up? Consider this diagram here? This circle shows the space of all possible sdr vectors and we want to match some other candidate vector against a specific set of n stored vectors here. So each of these dots is a particular vector that we want to match against and we would like to match match them in the presence of noise. So, in the case of sdrs, each bit has semantic meaning.

A

We know that we never get the same input twice, but as long as the input is similar, the bits are going to be shared and there's going to be a high overlap between them. So what we care about is how well can you? How well do two sdrs overlap there? So the way you can do that is, you basically decrease the threshold theta and, as you decrease the match threshold, you can see that the white space around each vector increases. This is the set of vectors that match given this threshold.

A

So as you decrease the threshold, you become more and more robust to noise you're going to allow more and more patterns matching against that candidate pattern. Of course you don't get anything for free, as you do that you also increase the chance of false positives if, as the white area grows, there's a much higher chance, this is going to overlap with some other vector that is not coming from uh the source that you're interested in okay. So we're interested in this trade-off.

A

What is the size of the white space versus the size of the the gray space? Okay? So it turns out. You can actually calculate this and we can do this using something called the overlap set, so how many vectors match as you decrease the threshold okay, so we define the overlap set of x to be the set of vectors with exactly b bits of overlap with x and in order to do a match. Let's say you have w of 40 and your threshold is 30.

A

You can you know you can see how many vectors you have that have exactly 30 bits of overlap. How many vectors have exactly 31 bits, 32 bits and so on? And you add them together and you get the size of the white space? Okay, so how many vectors have exactly b bits of overlap with x?

A

It turns out. This is the equation for that and basically there are two components to that. On the left-hand side, you have the number of subsets of x with exactly b bits on. So let's say you have 40 bits that are on and you're looking to see how many vectors are there that have 33 bits that so that's going to be 40, choose 33 and then in this.

A

The second component of this is how many patterns you have, with w minus b bits in the rest of the representation, okay, so for each each vector, with exactly b bits on uh you're, going to have a number of other patterns that have w minus b bits on okay and the product of that gives you the total number of vectors that have exactly b bits of overlap with x and then the error bound is simply going to be now the ratio of the white space to the to the gray space.

A

So you look at all the. If you look at b, all the way from theta to to w you add them all up. That gives you a sing, the white space and you divide by the gray space which is n, choose w, and so, if you have a single stored pattern and you pick a random other pattern, the probability of getting a false positive is given by by this equation.

A

Okay, now, if you have m stored patterns, you can get a pretty tight upper bound just by adding all of this up. So if you look at the union of all of the white space uh in the diagram that gives you the total amount of um you know in all the other vectors that gives you the the total number of possible false positives and then divide again by the the and choose w, and that gives you the the probability of a false positive.

A

This equation is very hard to get an intuitive understanding, for there are factorials and exponentials in here, and it's really hard to get an intuition behind it. So what does this actually mean in practice? So we can plug in a bunch of numbers, but essentially it turns out that with sdrs you can classify a huge number of patterns with substantial noise in them as long as n and w are large enough. Okay.

A

So, for example, if you have n of 2048 and w40 turns out, you can have up to 33 noise up to 14 bits of noise, and you can actually classify a quadrillion patterns with an error rate of less than 10 to the minus 24.. This is basically insane right, so yeah there's a thousand trillion patterns with an extremely high amount of noise, and you can classify them with very, very high with very low probability of any sort of error, and this is sort of the beauty of sdrs.

A

This would be very difficult to do with a dense representation and essentially what's happening with an sdr. Is that it's changing the representation in such a way that there is a there's, a tremendous amount of room there and you can create a very simple recognition system and recognize a very large number of patterns very, very robustly.

A

You can actually get do even better. You can get up to 50 noise and the error rate is still extremely good in this case. It's one in about a hundred billion, so you can classify again a very large number of patterns with a lot of noise with a very low error rate. Now it turns out again from the from the math. This only works if n and w are both large enough.

A

So as an example, if you take n of 64 and w of 12, you can take the same percentage of noise, which is about four bits in this case. In this case, you can only classify about 10 patterns and the error rate is 0.04 you're in a very dramatically different regime when you're with, when you have these small numbers, so you really need the numbers to be large enough to get into this really nice regime, where you can classify things extremely robustly?

A

Okay, so this is so. This gives us a lot of guidance as we design our systems into how we set the parameters to be in this regime where we get very accurate uh classification.

A

Okay, we can also learn a little bit about neurons from here, um and it turns out that neurons are actually extremely robust pattern. Recognizers we've talked about the distal dendritic segments. They essentially do this match operation, so they're looking at an overlap, so each segment is looking at an overlap with the input sdr. If the overlap is above some threshold, the segment says it's. It's detected that pattern, and so we we can understand that neurons are extremely robust pattern. Recognition systems from the math.

A

It's also interesting that the numbers in the math actually correspond really nicely to the numbers we see in neuroscience, so the math it says you know you need to have w has to be about. You know a dozen two dozen before you get into this nice range and it turns out that distal dendritic segments you see about a dozen a few dozen synapses in each segment, and it's it's really nice that there's a very nice correspondence there between the theory and what we see in biology and I think, there's a good rationale for that.

A

Again, you need a high enough w in order to get high accuracy. You can also have now tens of thousands of these neurons arranged in a network looking at the same input sdr such as the feedback, sdr or context sdr, whatever it might be, and picking up very robustly picking up very subtle patterns in there again. The capacity for doing this is huge and you have extremely robust recognition as long as you're in the right regime of numbers.

A

Okay, so sdrs give us the capability to recognize a very large number of patterns with very high noise. Let's talk a little bit about random deletions as well turns out, there's actually very similar to the previous case. So sdrs are very robust to random deletions um and we know in cortex that bits in an sdr can just disappear, so individual synapses can are very unreliable.

A

Neurons can die in fact, in a whole, patch of cortex can be damaged, and we know the whole system uh can recover and work. Just fine and the analysis for this, I'm not going to get into it here- is actually very similar to the noise case.

A

We're just it's we're just dealing, it's just a different type of noise in there and the same basic math applies, and it turns out the sdrs can handle fairly significant random failures and, to the extent that we're using good, sdrs everywhere in the system, failures can be tolerated in any part of the system.

A

You can have fairly large dropouts almost anywhere in the system and system can still perform really really well.

A

This actually a great property for those building. Hdm hardware- and the nice thing about this analysis- is that we can actually characterize the exact probability of the failures given the the system, design parameters, okay, so the for those building uh hdm hardware systems- this is a really useful property.

A

Okay, let's talk a little bit about the next category, which is representing multiple patterns in a single fixed structure.

A

So there are a bunch of situations where we want to store multiple patterns within a single sdr and then later match them um against a candidate sdr, um and you know, jeff talked about one example of this, which is looking at the proximal dendrite. There's another example which is in in temporal inference. The system has to make multiple predictions about what's going on in the future, but those predictions are represented as the predictive state in in cells and there's.

A

We just have a fixed set of cells, and so somehow you have to be able to represent multiple predictions at any point in time about the future. In this fixed, uh fixed vector and these predictions, uh you know the set of candidate predictions changes at every time, step. Okay, so it has to be a very dynamic thing. The brain doesn't have the capability of allocating memory, all the time like we can in software, it has a fixed structure and you have to be able to represent this very dynamic property within this fixed structure.

A

So it turns out you can do that with sdrs up to some limit, so we can store a set of patterns in a single fixed representation just by taking the or of all of these individual patterns. Okay, so here's an example: it's the same example that jeff walked through earlier suppose. You have ten different vectors, each with two percent sparsity.

A

You can take the union of them, which is the or of all of them and create a single vector, and if each of the individual vectors have two percent sparsity, the end result is going to be.

A

You know something less than 20 percent of the bits are going to be on so now you can take another vector, and you can ask, is this sdr a member of this set or not, and you can do that by taking the same overlap operation in the same match, operation and ask are the number of overlapping bits sufficiently high and if so, we're going to say it's part of this representation?

A

Okay, so of course the the more and more vectors you ore together, the more on bits, you're going to have and there's going to be, this trade-off again and the vector representing the union is also going to match a large number of other patterns that were not one of the original 10 or not one of the original set. And so where is this trade-off? Where does this break down and how many such patterns can we store reliably without a high chance of false positives?

A

So the result is actually a little bit counter-intuitive, I'm not going to go into the derivation of it, but you can calculate the expected number of on bits as you or these vectors together.

A

This calculation is exactly the same as with bloom filters, if you're, if you're familiar with them, so you can calculate that so now you have an expected uh number of on bits in the union in in the union vector, and you can then plug that into the previous uh equation that we had and we you can actually calculate exactly the chance of a false positive without going into the details again. What does this mean in practice? So it turns out.

A

You can form reliable unions of a pretty reasonable number of patterns again assuming a large enough value of n and w. So some examples.

A

uh If you have n equals 2048 and a w 40, it turns out that you can actually take the union of 50 patterns and have about a one in a billion chance of false positives. This was this is not intuitive. um Each pattern here has two percent sparsity, but you can actually or together 50 of them and intuitively you might think, oh well.

A

That means all the bits are going to be saturated, but that's not true, as you orb each pattern in there there's an increasing likelihood that there's bits that are shared with some of the other bits that are already on and actually, if you ore, 50 of them together, it turns out only about 60 of the bits are actually on.

A

There's 40 percent are off, and so in order for another random vector to match that um all of the on bits in that other vector have to be within that 60 and the chance of that is actually very low, and so that's how we get that's the intuition behind why this error is as low as it is. Okay, but again you need a large enough n and a large enough w to do this. If you have n of 512 and a w of 10, you get a much much higher error rate.

A

Again, you have two percent sparsity, but if you take 50 patterns and union them together, you get a pretty high error rate of 0.9 percent. So again the analysis points us towards having a high dimensionality and a reasonably high number of of onbits in the system.

A

Okay, so um those are the the main properties I want to talk about. The last thing I want to touch on is the fact that the efficiency of the system- and it turns out that sdrs enable extremely efficient operations.

A

So in cortex we know that complex operations are carried out very rapidly. um The visual system can perform. You know extremely complex object, recognition, something like 100 to 150 milliseconds.

A

If you assume that each neuron can fire at most, you know once every five milliseconds then there's something like 20 to 30 computational steps in order to recognize a face or a person or a you know, an animal or anything like that. So it you know, the number of operations has to be extremely fast. There's no there's no real time for loops or optimization steps, and so on.

A

So, even though sdr vectors are large, all of the operations I've talked about are actually ofw, it's so of the number of, depending on the number of on bits, not on the underlying size of the vector. This would not be true if you were using euclidean distance or hamming distance. In those cases you need to look at the entire vector, but with the overlap operation, you just need to look at a small number of on bits and that can be done extremely efficiently.

A

Similarly, matching a pattern against a dynamic list like the union is also ofw only care about the number of on bits, and in this case it's not the number of on bits in the union vector it's, not the sixty percent of bits that are on you just care about the number of on bits in the vector that you're testing. So if your vector has uh 40 on bits and your union vector, has you know a thousand on bits, it doesn't matter.

A

It's all you need to do is check those 40 bits, so you can do this again extremely fast, and this is, I think, key. This is really what enables those tiny dendritic segments to become really robust pattern recognition systems, there's not that much room for computation in those segments, and so because of the nature of the overlap, operation and the nature of sdrs. You can actually get extremely robust recognition with very, very little computation and we exploit this in our software as well.

A

All of our software is written so that it's all dependent on the it's really depend on the number of on bits and on my laptop, you can easily simulate something like 200, 000 neurons at 25 to 50 hertz on you know an eight core laptop, so you can. Thanks to these operations, you can actually uh they're amenable to extremely uh fast operate, uh computation.

A

Okay, so that's the the main uh stuff I wanted to talk about just in in summary, sdrs are the common data structure in the cortex sdrs enable very flexible recognition systems that have very high capacity and are robust a very large amount of noise.

A

The union property allows a fixed, represent representation to encode a dynamically changing set of patterns in a very again in a very robust way, and although we're just getting started in this kind of analysis, the hope is that this, this kind of analysis will provide a principle foundation for characterizing the behavior of htm learning systems and then perhaps all cognitive function as well. We think that this is kind of the basis.

A

If you want to understand the system, you have to understand how the data structure works and that's going to give you bounds and scaling laws and and tell you where things work well and where they don't. So. This is a start at having a much more principled foundation behind the theory.

A

And lastly, I didn't really mention uh other work uh in in the talk per se, but over the last 15 to 20 years, there's been a fair amount of work in understanding, sparse codes and sparse representations, and I'm just picking out sort of three bodies of work here that have been influential to us and canerva's work on sparse memory, bruno's work on sparse coding and then the bloom, the math behind bloom filters are actually very relevant to the stuff we're doing. Okay, thank you.

A

I think we have time for yeah.

B

We have about 10 minutes for questions.

A

I think there's a microphone floating around yeah.

A

While this is coming around, I want to mention that, if you're interested in the theory behind this, the new big theory mailing list is a great place to join in and ask questions around this and we'll also be around in the hackathon as well.

C

Hi, uh I have a question: what kind of data you can represent in this data structure now and does it have any data type is really hard to represent in this data structure?.

A

So the question is what kinds of data we can data types we can represent and are there things that we can't represent very well in this data structure? I think it's a good question.

A

You know sdrs are extremely good at representing um you know, sensory patterns and the types of patterns that underlie you know cognitive operations. However, it would not be my first choice to represent some of the stuff we typically represent in computer software, so I wouldn't use it to represent a database. You know you wouldn't want it to be the primary representation for unicode or you know, sql databases or or docking. You know some documents.

A

So if you want very precise representations where you can't tolerate noise, you can do it with sdrs, but they're not going to be as efficient. You know something like ascii or normal. Dense representations are going to be much better for that, but by and large anything any sort of information or data that you know we typically use in cognitive processing can be represented very well with sdrs.

D

Yeah, I have a question which made me a bit confused. So actually you said that we have a structure for connectivity right, so we don't connect every neuron to everybody else, right, that's right, yeah! So how this is represented. How is it learned I mean, and and also all the math you showed- I mean it's assuming random uh ones right. So if you have structure in connectivity that doesn't hold anymore right, I mean you, don't okay,.

A

Yeah, so there are a couple of other couple of questions in there right, so one is that we don't connect to all of the neurons. So how does it learn and in biology?

A

The way this happens is that you know each segment connects to a particular set of neurons, but there are a bunch of other connections that are kind of nearby and uh jeff talked about the growth of synapses and it turns out- and you know, if you know if a an axon is firing um and it's correlated with a cell- that's firing, that's nearby, you will eventually grow a synapse and you will so there's a bunch of potential connections that are there are okay, but it is not anywhere near full connectivity and.

A

Yeah I sort of ignored spatial layout and topography and so on, and we can. We can talk about that too, but um the same principles will apply in those situations. um You know I mentioned that sdrs can work with random deletions. What I didn't talk about is that you can actually subsample the input and you can get very robust recognition as well. It's really the same thing, and so that's another way that you can avoid having a full connectivity.

A

Okay, I think the second question was you mentioned that I assumed all random randomness and no structure as well right. So um this this analysis did assume that everything is is uniformly random and there are sort of two ways. Two answers to that. One is in learning theory. The way that is handled is by using something like a pack- formalism, they're, probably probabilistic, probably approximately correct formalism. So there you incorporate the the distribution of the patterns into your analysis and I think the results will come out very similar.

A

uh If you do that, the other side of the coin is actually one way to think about. It is maybe it's the job of the learning rule to come up with representations that actually randomize the data and make things more independent.

A

And if you look at the hebbian learning rule, it's known that if you just analyze that it actually ends up picking the first principal component and if you have a bunch of cells that are connected with inhibition, it tends to pick up orthogonal components, and so the heb learning role actually helps kind of make things more independent.

A

So there's sort of two different answers that I think essentially you'll get back to the same. You know intuitive results that I presented here.

E

How do you model the time here? I guess it's that union uh of vectors yeah, so basically each vector will have will represent, will be a sensory vector at a certain time, and if this is true, I guess it will be a kind of window of time which will be considered through that union of a number of vectors.

E

What is the length of that union? How many vectors you consider in there.

A

um Not sure I understood the question fully, um are you talking about how we deal with sequences here? um Okay, so I didn't really go into the details of the htm learning algorithm in here.

A

This was a fairly abstract analysis dealing with sdrs, but this analysis applies very well with to the temporal memory structures that we that we use- and in that case uh jeff alluded to that a little bit we have you have cells in a column and the sparse set of activity in the cell represents the temporal context, and so you have um when you have patterns that follow that temporal context, they're represented by other cells and and the dendritic segments in those cells are recognizing the previous temporal context.

A

So what this analysis tells you is the bounds of well, how many temporal contexts can you recognize uh and how much noise can you have in those in there before this? Your sequence mechanism starts to break down, so the analysis gives you a lot of insight into that. There.

F

I I think jeff has a. I just have a comment on that. I I might have heard the question differently I might have not, but I think the question was asking you thinking that the these union was of different sdrs at different points in time, but they're not that's a it's like if we do a union of predictions. It's all. These are all simultaneous predictions.

F

It's we're not using the union property to represent time, it's a spatial union property. So, for example, as you're listening to my speech, your brain is making many many simultaneous predictions about what I might say next, the attributes that I might say next and um so when you you'll know, if I see something, that's unexpected, but you can't make a specific prediction, but anyway, that union is a is a point in time. It's not we're not using the union to represent time itself, that is in the transition states. In the memory system. I talked about.

B

I think we have time for one last question. Thank.

E

B

So what is the methodology that you use to convert an sdr back to the original data raw representation to either detect an anomaly score or make a prediction.

A

Okay, so the question is: how do we take an sdr and convert it back? Basically,.

B

And what information, what excess information do you need? Because I'm guessing you, you can't just do that, based on what the sdr says right.

A

Yeah and in in newpick it's a fairly straightforward thing, so we have a process for uh taking. You know arbitrary data and encoding it into an sdr, and then we feed it into the hdm system. At the end of the day. uh What we do is we take. We have something we call a classifier that maintains a mapping of sdrs to actual values and we go through the classifier there to uh recover the original, the actual, let's say the predicted value whatever.

A

So it's it's sort of handled outside of of the cortical theory as well. So we have a process for encoding things into sdr, and then we have a process for decoding them back in different contexts back to the original value, but internal to the htm, no matter how many regions or layers you have, it's all the languages, all sdrs, the encoding and decoding is just for interfacing with the rest of the world.

A

Yeah in general, it's not a one-to-one mapping, so the part of what the classifier does is is give you back the most likely uh value that so there's there's a document on exactly how we do that anything else.

A

We have one more question: yeah.

G

Oh, so you have these big vectors and so on and they generate others. So presumably things aren't recorded in the brain at any one time it takes many repetitions, so you have to kind of model the fact that things are remembered after many repetitions of the same vector is that the proper view, because there's there's short-term memory, long-term memory in the brain and.

A

Yeah, so this is exactly this is sort of the job of the learning algorithm, and so when I showed you those synapses, it takes multiple repetitions before those synapses are connected. So that's one way that learning happens and we also have this dynamic state, which is the the predictive state of the system. um And so you know you can represent information dynamically and you can carry information from um you know through a sequence, you know without it becoming permanent as well.

A

F

Is it on there we go I'll, just add a little bit to that too in in biology. What we have in the hdm theory is. We have a learning rate which is sort of a rate. That's applied to a sort of hebbian process to the growth of the synapses, and so we can check that rate to make it so you take three three iterations or four iterations or whatever you want to become a synapse, that's useful. We can also make it learn very rapidly. We can say we can learn in one presentation.

F

We just make the threshold we make sure that the synapse gets over its threshold in one step, there's some equivalent to this in the brain. There are different modulators in brains which make you learn quickly or not. Learn so that's well understood, for example, dopamine is something that makes you lay down, memories much faster, so the real learning algorithms in biology are much more complex with a infusion of various different neuromodulators. We have a fairly simple version of it, but it could be more complex if you wanted to so.

F

We typically just pick a learning rate and just stick with it, but brains modulate that learning.

A

Rate. Okay, thank you.