Numenta NuPIC Tutorials, 10 Jan 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SDR Classifier

Description

Yuwei at Numenta describes the SDR Classifier and lots of discussion ensues. This was recorded at the Numenta office at an internal meeting Summer 2016.

A

So this is a technical presentation for the ICR classifier I will establish. The Gribble, reveals the problem of classification and prediction with HTM and talk about the classifier which agency forward classification Network and how I might the weights in the network on linear algorithms, there will be some algorithmic description of the classifier, and some comparisons with the Odyssey are reclassified as amateur presentation so for typical use of HTM.

A

They have some streaming data fitting into Royal encoders that into the digital cementation HTM model can other people, and we can do a bunch of useful to attract detection prediction or classification. So in this part, I will focus on this part classifier part and how we use the higher-dimensional. The sparsity two presentations in HTM prediction, classification task.

A

So currently there are three classifier and their nuclear algorithms. First one is a senior neighbour classifier, which is typically used for categorical classification. It maintains a set of template strd memory, but that not evaluate the full support, pretty predictive distribution. It just give you the best match using care. It's very simple, but may not work very well both on line production tasks. The CRE classifier is the one we have been using.

A

I've had or predicted value exact, if you.

A

This is population of developing the SDR classifier. The goal is to improve the prediction accuracy compared to the CRA classifier and here I'm, using a single feed-forward classification network. That's optimized, the using maximum likelihood. Estimation have explained those terms you.

B

A on the CL, a classifier uses, a heuristic voting Arvin. Is that something is no I mean that term. That means something that just a description of some house. We made up.

A

B

Just that is it's just by reading us yeah yeah.

A

Developed by Numenta and basically each speech in the SDR, multiple sound of prediction value and you combine the voting to another thing, building in forever.

C

Somebody's it reasonably gas with that yeah bishop eyes, there is already documentation and video explaining what that is. Yeah.

D

C

A

So I will focus on the IPR classifier. In this talk, it means to be replacement for the theory classifier that gives you a full distribution network works for both prediction and classification problems.

A

So here's a set up as the our classification problem and the goal is to map a sequence of high dimensional SDR people as X here, kids time to a distribution over a set of K classes. So this is the output of the classifier, which is also changing over time, and so this is predictive. Distribution should sum to one at any time point and the goal is to to have high prediction probability for the true class label, which is the or there is the training data like for typical prediction. That's maybe five steps ahead in time.

A

That's no true data I'm.

B

Confused I'm sorry use to the CLA classified takes a current state of the HTM temple.

D

B

That's it. It's just success state which the state itself incorporates time. The past yeah, in this case you're saying you're, trying to map the sequence of individual I'm.

A

Also taking one against one date has been by the time all right thinking the Caroline so native.

B

It's not you're, not actually classifying this sequence, you're really going to classify the STR STR, the single St on the idea that not the sequence, it's not a yeah. Okay, the same idea your to take whatever the current state is, and yet you're saying that state was derived from a series like this yeah. How.

A

Maybe it's the clarify a switch, so this is the current state of the HTM. Imagine is hiding mission. Sd are a big showcase here, so I want to match it to a set of classes and say: I have three classes and I want to know what is the probability that the currently input lies in that class, so here, in the probability, distribution assigns to one and they're going to along the target. How high prediction probability so I have basically goes to encoding the kind of state. This is the output of my classifier and I.

A

Come I, try to optimize it that is match its target.

A

So a please forward. Classification network is just like that. So it's a linear, so each unit here, first I take a linear summation of weighted summation of all the inputs, so the weight matrix tableau is only parameter. I mean that the parameters in the model and because it's a preserve distribution here, the non-linearity there is additional non-linearity, called the stop the max to make sure that the prediction probabilities comes to one. So it basically takes the exponential of each input and then divided by all the other inputs. So it's a normalization that here.

A

So this is only difference from physical perception. All artificial.

C

Neural network, but the softmax is a well-known. It.

A

Is a very well-known thing: yeah and question: how should we learn those connection weights such that the prediction probability matches the data, so we use the maximum likelihood. Estimation likelihood is basically a metric of how well you are MA during the data. So here we are trying to predict the data they hear. The true label and likelihood is simply the probability of observing this to the data and there the predicted distribution. So here is my model predicted distribution I want to make sure that the true data is more likely to occur.

A

According to my model, and typically people use independent assumptions, basically saying that each input in chassis, our input, output, pairs independent with each other such that you can simplify this to a summation of two the product across.

B

Time steps using independent internal, independent.

A

Or contacts which may not be true, but it's necessary for the duration that was.

A

And the maximum likelihood estimation principle is simply to say that the parameter W that maximize that I should hook here. So how do we do that in machine learning? Typically, people define a loss function that is easy to up to optimize the instead of the original likelihood.

A

Typically, people use words known as an additive localizable. It's simply the lost logarithm of this and we were tricked by the smile so because logarithm is a monotonic function. Maximizing likelihood is equivalent at minimizing the Maglev likelihood function in this case and the way we do that is to do gradient descent using this loss function. Basically, you calculate the gradient of the loss function with respect to all the parameters in the model.

A

That's the connection rate matrix the four divisions available in this document, but the after the division to run it was very simple: it's basically the difference between your model, actual output and the target output times the input. So this is the connection waste from the IT input to the J's class. You adjust it by a proportional to this. This gradient, it's somewhat intuitive, to see how how this is developed. Basically, this part isn't using the chain rule. This.

E

A

Just linear, so this is the equations describing the fifth power net network.

A

This part gives you the input if you derive calculated e root. The root of this is retrospective. To example, you'll get X I and you calculate the derivative is X to the AJ. Here you get this.

C

Part different in our case exercise will be the zero one yeah.

D

C

A

A binary, sparse vector as the input and the target output is also zero one. At any time. Pony. We have one target output, the only notion through class label, so this is algorithmic description that the our classifiers are basically three phases. First initialization and inference and learning in that initialization is simply to initialize the connection weight, matrix, W IJ to be zeros everywhere. That implies that all classes occur with equal probabilities before this is obvious.

A

If you look at the activation function in the class in the classification Network, basically a kata Alderaan's before learning and you have the same probabilities for our classes, an inference is to calculate the model predicted class probability for each input. Pattern: X, I, here, okay, using the same equation and learning, involves adjust the connection, wait woj in proportional to the gradient, since we consider a binary inputs years, XII either zero one.

A

So basically, we will adjust the weight if, for me, for the actually inputs, so we don't need to adjust all those with connection, so only at any time, only a very small fractions with matrix is updated, because this input is my response and the traditionary additionally for scalar value prediction will keep on an average of actual values that correspond to each class. This is the same as the oldest.

A

They are a classifier just to make a prediction a little bit more accurate and the time complexity of this algorithm is proportional to the number of a cube beads at any time points times the number of classes. So this is very easy to see because here that the summation here involved with this time, complexity is how long, how? How does the algorithm scale with respect to the number of the input? With respect to the number that the.

B

C

C

Yeah you keep a separate table so we're apart from the network where your yes.

A

So there is a value corresponding to each packet. Each class and you'll keep on going average to subtract of the actual.

C

Values so basically the buckets in the scaler encoder, because the masses are the same as classes. Yes,.

C

A

So compared to the theory classifier, this is small, expensive, because also the Doceri classifier time complexity is only the edge times ends at the number of activities. Here we need to additionally times the number of classes, because we want to evaluate evaluate the full distribution at any time. Point.

A

So there are few properties of that dr classifier derived from missions. The first station is a continuous learning algorithm, so here the retirement his table has to adapt in animal fashion, so they're, not using group training for the classifier and second HTM use IPR.

A

This is a very important and that, because of that, we only need to update a very small fraction of the weight and also a side benefit is because a lot of the ways are not tuned at any time practically seems to be less prone to overfitting compared to networks that used any input vectors. That's just an observation.

B

A

How there are some papers saying hotspot coding, hafnium is over overfitting of in.

C

Different models, if you have some of the Hitler's as first yeah I, shall prevent overfitting the exact same reason: oh yeah, it's not intuitive I. Just.

B

Didn't know that something observed previously. No, it's.

A

What we observed and I think it's out some knowledge and.

A

Finally, because HTML supposed some opinions, orientation of multiple predictions, so I see our classifier also evaluated the full predict distribution so and it's reinforced crack predictions and also penalized incorrect predictions. So the second part is not in the theory classifier and that's the reason why I write theory, the old area classifier occasionally gives you outliers, because only unit area, classifiers according to the voting scheme, is only the correct prediction that involve reinforced, not the incorrect one- and here is a simple experiment of classifying random STR.

A

So here I have a small size of a label as they are. The tactic is to given that he are reliable in Australian scenario. It's like a unit test for the algorithm and we compare with the progress. They are a classifier on this top, so here I'm showing the lag I'm.

B

Sorry I don't understand instead of jerk, because you say: you're 20, labeled, STR yeah. What you're doing in a streaming scenario I mean you're labeling, the the state of the HTM at any point in time, yeah yeah are you? Are you streaming in a sequence and then classifying it you're classifying in each step? I'm. We.

A

Classified each step, it's a continuous learning. So at each time, step I joined them. They still act ycr and fit it to the network. What is the datum online? It's just a simple, a stream of random STR. So.

C

This is just like a unit test for the algorithms assume there's some as we are coming in you.

B

Know it could be.

D

C

A

B

A

B

Just Jones.

A

In this case, no sequence structure.

B

It's just a so during by calling in the streaming scenario, then it's confused.

A

Streaming means I'm monitoring the performance online yeah.

B

Okay, it's not really. The data itself is not simple data at this point. Oh it's just randomly STRs and yeah and you're just updating every every day to contract. Don't you mean yeah if I could have said this predictive label in an online learning fashion, yeah? That would have been the same thing yes and clearer for me. Okay, if you think.

C

D

C

Only have one label.

E

C

Multiple label, one label.

A

Okay, why do you want multiple.

C

Tradition, yeah.

A

C

Because we're the whole setup here is to predict this.

A

Yeah, it's possible to do that, but in this case just wanna go.

A

So this is the performance over over over time in online learning fashion, so for both that D and C are classifiers likelihood improves over as you as network.

B

As a classifier I'm, just going through the 20 labeled ones, in order to produce random.

D

B

Around it doesn't really matter, yeah doesn't matter, okay, so just randomly yeah, so my time I get. The 200 is good chance that everyone's consuming about 10 times. Yes, yes, so.

A

Here, I'm, showing as the end of learning what is a model output if in this case the true label is number 10, so for the STR classifier, it's almost predicting perfectly I'm, very sure this is class number ten, but for the ferry classifier you can see because of the incorrect incorrect prediction are not penalized is also predicting other classes with a small probability.

A

So this is a page without noise, so you may say it doesn't matter, because you're handsome do.

C

You think that this is more juicier new, any rule or the due.

A

To the burden rule because.

A

Stereotypes, I also have his own nomination scheme. I, don't think that's I'm, actually, friends on something max and.

B

This result is sort of a given should be. You know it's almost like this is just showing the weakness of the CLA classifier as opposed to yes, yeah yeah I mean you. Could you could get this perfect performance at the KNN right? I mean just twenty things perfectly match it. So.

A

Yeah, actually, if you look at prediction accuracy, instead of likelihood in this case, there will be no difference, because it's very easy at all traffic are coming to learn.

A

But if you have noise in the second experiment, the task is given corrupted nor oxidized ers predict the label, so it's a little bit more challenging.

A

So here is a performance, so I'm showing the stable after it learns the sequences that I.

B

See are after you're, not training on noisy data, you're training it on clean daily data, and it has done justice yeah.

A

You could have trained yeah I.

B

Could do it the other way, I, don't.

A

B

The guy's interesting actually after you show this yeah yeah.

A

I could do that yeah, so this is trampling they'd have had to be noisy data, so here I'm showing you performance as a function of noisy level, noise level, so that see ours are forty out of 2,000, typical STRs. So now it's level forty will be completely random, but as you can see that it's guys, the new at their classifier is still perfect actor certify was the beach helping is immediately.

B

Channel something you just said: I mean 40 out of 2,000 or typical column or activation, or the output, the SP, but most the times when we're classifying. This we're, classifying complement the temple memory state, which would be maybe 40 out of 20,000. Something like that should be much our sir.

D

B

I'm not sure how that would impact things, but just to just be clear. Yeah.

A

Yeah yeah here I'm, just focusing on the difference: Mitch didn't you I, don't have a partner I.

C

B

Think it would make any difference. One be clear that there's someone who listening is that what are we classifying we're not classifying now for the SP, a request by and it's fine to use, 40 out of mm. It's just going to understand. Yeah.

A

Yeah, and because of the basic properties of I see, are your: do we expect that you can still classified, even though a significant fraction of the base top corrupted well.

D

C

In this case, generally of twenty SDRs yeah.

D

C

It's not using up all 2048 bits yeah, just knowing what a couple of bits are should be yeah. That's why? But you have multi labels who have.

A

It's also like a unit. Has it's not a real word.

E

Tiny is the number of passive you remember, those even.

E

D

A little 20.

E

You're, like oh.

A

E

Training for one actress you.

A

Okay, even as Aboriginal were all tiny, so basically I'm.

E

A

The disabled performance, so this is a rolling average no over it's similar to average across all of the data.

A

Yes, so basically, I think that the new idea classifier preserves robust noise robustness of the IPR, whereas the old Sarah classifiers graduate the performance graduation rate as a function of noise. So this God is showing.

B

That respect shown on the right, but in the previous light, is.

A

Getting worse, as you all know, as you have not so this will be the effect showing and the privacy right here. It really increased noise. This is different. Sketch butter and the third experiment is a continuous learning, so I trained it after it's a stable performance and then I've switched to a different data set and see how much does a have attached to the new data set. So the SGR classifier takes about the same amount of time for it to get perfect performance.

A

We were actual the ferret ratifiers, somehow never recovers to the premise based on I think they still because of a lot of the false predictions. Still there it's not penalized, so you learn the new new ones, but the old ones are still there. So, basically, your your distribution gets like the precise.

C

Unh was the network with zero ways, yeah.

D

C

Would you would you think, when you kind of interruption to do that because of the penalisation that it'll, basically reinitialize.

A

Depends on whether it's sharing the same ways, whether it's conflicting data set, if I I'm, assuming here it's just another type of animosity- are some of the people completing some of the matter and they're not much on average I. Originally. This is.

C

Not very variable, okay and.

D

The others should then go.

C

D

A

I, can you imagine if I switch back to the original dataset? After this, it still got pretty good likelihood, because it's using different set of connection weights, you have one side. West yards and I can settle back here. There are not much confliction to change.

A

So we also use the SDR classifier for the taxi passenger, some prediction- and this is in the neural computation paper and again. The topic is to predict future taxi demands, so here I'm using encoder sequence, memory and STR classifier network. So it's classify actually classifying the states of each gem, so I don't have the underlap is the steroid classifier. So if you use the traditional root, mean square error metric it's on file, we write for the STR transfer response rate. It's much better, then you can use your traditional metrics, not moving likelihood.

A

Also, the prediction looks much cleaner, a lot of so, as you can see here, there are a lot of false predictions. Occasionally you even get a dramatic outlier here, which has a very big impact on Ramon. Explain what those red things are? Okay, so so the black is the data. It's likely the true data we're going to predict. The blue is the the best prediction according to the classifier and the right is the underlying distribution predicted distribution of the data according to the classifier, so.

B

Again, this is drinking up or advanced five steps ahead, which is so much not a tool, and two and half hour of time I'm just saying I can't: okay, I shifted.

A

The prediction to align the ways of data these.

C

Are daily, these are the only right yeah it just bring half an hour. Yeah.

A

So I'm showing so this will be just one day: 24 hours, 48 data points.

C

You can see like on the left like around 17050 there's a lot of pulse yeah.

D

C

Blue shows a lot about Asia and.

A

She also got one here: oh yeah, it's not perfect weight. Really, all this, it's so better, but I don't expect that any classifier to be perfect with.

B

Real data, well, is that one jump shot is again as an Lyr. It is out buying it, which I would have been nice if all the Allies disappear. Yeah yeah.

A

I mean yeah: do you think we could do to improve it like the the current implementation that have a single learning rate, so there's a one critical parameter, which is alpha? It's how fast that you adjust with the weight.

E

A

Now we just set it to be a small number, and maybe you could use smarter things like her. You adaptively adjust the weights if you're using a momentum term, other machine learning techniques that could always further improve the performance.

C

We're going along a data stream, it would be difficult to schedule a decrease learning rate yeah, that's assuming that the issuing or remaining relatively constant yeah, but if we see not to try some sort of an almost behavior means you know changing yeah such that you hit costly.

C

Well, you know some other drivers increase the learning rate. After seeing some anomalous data, yeah yeah.

D

A

There's no further improvement, they are so here. We are right now we're using very simple learning hours, I'm sure, leave.

C

It with that, it's quite a bit better yeah.

A

C

A

I think that's how I have so.

B

It just checked in is a new page with the status of this. Its senior kick we're gonna, remove the co-lead.

C

Classifiers you stay there. We should at least deprecated it yeah I, don't want to switch to this. One is the default. Everything I think there is no truth anywhere anymore, good. It's yeah, I.

B

Think it's in sort of ironic that we applying sort of a classic neural network. On top of you know, biological biological neural damage, I'm.

C

Using enough mediating right now, yeah I mean we can't we do it with Cannon as well. You.

D

Know plastic: it's.

B

Just funny, that's all music on that yeah.

A

Why they named it as the Arctic yeah.

B

Well, there are some things you slightly different because of the sparsity of the yeah of the patterns are classifying I, just curious. How would it react? You know a typical HTML sequence memory. You often have all these union of states and at once, and so, if you try to classify that what you can, what how this behave in that situation, if you.

A

Like if I have to SDRs present it simultaneously, oh.

B

Yeah so, let's say I'm trying to classify two sequences and I run those sequences to the end and then I get a state and I can't find the run. See I mean well it's somewhere along the way. Those sequences are the same I guess they'd have the same label. So at that point you have this.

B

I do think it. What is it? A state could be labeled in a very sparse. They could be labeled in two different classes. That's more issues, any other issues. You can have a union of states, yeah.

A

If you have a union of states, you'll put that multiple predictions in the.

B

Would you that'd be nice, I mean it would be nice if you could say well, do you think a Dean? You know yeah I think we.

A

Have the only test for that to make sure that means those scenarios you'll get an article prediction, so basically the the prediction will have two peaks so.

B

In a super practical point of view,.

B

The other scenario where I have essential sequences that are the same for a while, then they differentiate I I, want that even means in this case, when I classified every point along the sequence and then I'd be classifying the same state intuitive way. I think it depends how.

C

Far ahead, you're predicting or classifying in our thing, you can say how many steps ahead. Yes,.

B

For prediction, but yeah pure classification, it's the same.

C

Here so when we say we're predicting you're really classifying the next day, including buckets, and so if it's the next day, it would say the same. The same the same.

D

C

It defers and if you're doing multiple steps into the future, it will immediately give you a proper linear distribution.

B