Numenta Numenta Research Meetings, 8 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: "Improved Expressivity Through Dendritic Neural Networks" Paper Review - June 8, 2020

Description

Ares Fisher discusses dendrites in machine learning, and reviews the paper “Improved Expressivity Through Dendritic Neural Networks” by Wu et al.

Paper: https://proceedings.neurips.cc/paper/2018/file/e32c51ad39723ee92b285b362c916ca7-Paper.pdf

A

All right we're good right.

B

So I wanted to go through this paper from 2018 with you guys, since it has some relevance to what I'm doing but reading through it I realized it's not as interesting as it seemed, or at least you know, that's my takeaway, and but what I really wanted to do in the reason I postponed it from last week is I was getting into the literature of using dendrites or things that are inspired by dendrites in the machine.

B

Learning, literature and I wanted to like wrap my head around that and and see like what the you know, what the potential is in different directions.

B

So briefly, the the idea of using dendrites and machine learning, the earliest reference I could find that was also cited by a lot of subsequent papers was by this paper by when I got the point of Z and Mel, and what they did is basically, as this little hand shows you is that they could approximate the firing rate of a pyramidal CA one year on by modeling it as a two layer neural network with a sigmoidal output. It was a sigmoidal activation function, so that was like their best fit model basically and they're.

B

Like oh look, so basically, the neuron is complex enough that the that each little each dendritic segment is like a little neuron. Basically, a little artificial, Europe and I didn't show the HTM you're on here, because it's it has a little binary things and I thought that would confuse people and but basically, what people have been looking at since is this sort of separation between like apical and so, and they don't call the basal usually, but mainly like a feed-forward dendrites that strongly influences so mo.

B

So, oh yeah, sorry, so this is like their best fit in that in that paper they basically find it like the sigmoid a they derived a sigmoid that had like the best fit to the to the real data, and you know it wasn't linear, which also also means that you know there's a ligand non-linearity happening in the in the dendrites, but.

C

What was the real.

B

C

B

They used the patch patch reporting as clan recordings from ca1, neurons and I, think they stimulated just various different dendrites.

C

huh I mean because in general, when you get well this, maybe they made out to be thinking about the critic. Spikes. But generally you can't get a cell the fire from activating. You know a section of a yeah.

D

I think this was if I remember correctly. This is from a model neuron, not actual recordings.

B

C

Sorry, you might be right. Well, I mean this is a lot of literature which shows that when you activate distal dendrite, even in combination, regenerated, dendritic spike that it does cell doesn't fire. That's when.

B

We know that sorry, if I said that it was by accident, I think.

C

It's not even firing rate I. Guess it's uh okay! So anyway these are. They were trying to show how the these distal dendrites would affect firing rate. Not fire I mean because, because that little picture they are showing the the some of you know. Some of these two later all network would imply that I.

C

Guess you're saying you have to have a certain.

B

Focus on the apical dendrites, this was focused.

C

D

Even under Donny, because that was the whole premise behind the HTM sequence- theory good memory- so all of this was done on models, not actual recordings. They were actually Bartlett mill actually had a paper earlier than this, and then Yoda and Bartlett published this, and these were this modeling work was really what led to the experimental literature experimental work that came after them, so these guys actually predicted some of this beforehand, but this is all completely modeling work as I. Remember, oh you're,.

B

Right, this is from your own model. You're, absolutely I, don't know.

D

A real neuron you would not be able to do that as Jeff.

C

Well, that's what I was thinking. I was thinking. A real neuron would introduces chart yes, okay, that's right! Yes,.

D

I mean these are things we've discussed a lot over the last ten years. Awesome.

A

D

A

A little slow, then, what are the two axes? How's there, a predicted and an actual, are both axes models in some way or the.

B

Actual is that of the of the actual neuron model and the predicted is from the two layer neural network. So the neuron models up by a physical model, I say.

E

Area, I have a question: how do you combine the so each alpha is a dendritic brain right, and how do you combine them to form the? Why did like a maxout? You only need one of them too far too far away or.

B

No, no. This is like a simple way. This is a perceptron with sigmoid activation, so it's some it's a weighted sum right.

E

D

This is this is equivalent to a two layer, Network there's no max out or anything like that. It's.

B

D

E

This is how you combine the ends with the weighted sum, but then how do you combine the alphas? That's not very clear there. That's.

C

The same thing: isn't it it's just cuz its.

B

Activation of the this.

E

B

Model neuron the soma.

E

But that's not. The soma is oh okay, so I'll actor. The.

D

Alpha or the way to that output yeah alpha is the overall weight into the output neuron. Okay,.

C

All right, okay, high, which is as far as I, know, I, don't know, there's any that systems made up, I mean I, don't know if there is anything such as that. If.

B

There's any way anything anything.

C

Such as a rate of a dendrite branch.

B

Well, yeah, maybe the coupling between the dendrites and the soma like.

C

Yeah but it's a complex system, but idea that it might be learned, I've, never well,.

B

Yeah I know, but you know these models you, you learn like these um I understand.

C

I'm just pointing out the difference between these models, which are sort of coming in assuming that you're having a two layer that there's a two layer, neural network and the biology of a real neuron I'm. Just pointing out there's a lot of discrepancies between them.

B

Or they're drastically different yeah, this I'm just pointing out this.

C

We were just talkin she's, just asking about those alphas and they show them as little dots at bigger and smaller or something there's something some sort of learned. Wait there Rob is putting out thin and a real life parameter, so I, don't think, there's any everyday. Well, I would say: there's no where it's, because it's somebody's someplace, they said something, but in general, that's not considered there there's no strong evidence that there's like a learn the weight of a dendritic branch like that, but the outfits are so that they're made up.

B

So right so yeah, the idea is sorry. I mean move this screen over right. So in a previous paper with poor design. Well, they found that basically nonlinear having nonlinear subunits. Sorry, as you know, as dendritic segments in their in their models also led to much higher storage capacity.

B

So basically inspired from this I presume because they cited a lot. A number of papers have recently looked at its using these separate segments to to leverage them in their models. So this is from a review by deep mind. Is they really liked the idea that there's feedback projections going to apical, dendrites and feed-forward prediction is going to basal, dendrites and.

A

B

Consistent with all the literature and also the HTM model uses this and then what they want to do is, and this is about the the models that look at this- the apical feedback.

B

Sorry at the apical activity, what they focus on is this plateau potential. So when you have the green little thing here is the dendritic, the apical dendrite voltage. So if you have like a spike coming and if with feed-forward activation you'll get like an attenuated voltage increase in the calcium active zone, basically over here is you go to the apical site, but if you have the feedback and a feed-forward combination? Basically, if you have activation of the apical dendrites, you have this plateau and this plateau might have really interesting properties.

B

I mean it does have real interesting properties. So what the you're given Lily crap from proposed is that they have is that this could lead to like a much cleaner, a much nicer way of separating your error in in these more biologically plausible neural network models. So they compare this with previous loops. Excuse me with previous with previous models. Where you'd have you know back propagated you'd?

B

Have feedback connections going to you know: output, neurons to the hidden layer, neurons and you'd have like an error pathway that separately activated them and this ads model complexity, because you're adding a lot more parameters to the model and what I mean. This also adds a lot of parameters to the model. But they don't think this is biologically plausible, but they think this is and they think that basically having feedback projections from let's say your output layer, those can go into the apical dendrite side and sorry into the apical dendrite side and then trigger.

B

You know the learning might happen and the interplay here and between the apical dendrite loses a little Delta thing and the and the soma. And this way they want to propose a type of like local learning rule.

C

I would make sure I understand this drawing here. So in this drawing you're showing a two to circle gray circles: that's like the apical integration zone in a somatic integration zone.

A

C

All these things are really tightly here in them. So they're saying the feedback goes, the great black bee that goes just the apical one. I just took me awhile to realize that that's what's behind it, it's not two layers of neurons. It's a look. They showing two dots per neuron, basically and they're great great dogs right.

B

So they're they have, they have like two separate pathways. They have a feed-forward, this feedback pathway, as he said, and the feedback pathway goes onto the apical component and they think this is more biologically plausible than one of the other solutions which is this: having a separate error pathway or having feedback weights that are symmetrical to the feed-forward weights. But.

C

The upper drawing the upper doing doesn't show any neurons of a protocol dentro, so I just was confused by that. So yeah.

B

Sorry I didn't explain that Nelson. This is their proposed solution and they think that apical dendrites can be later or leverage for this. This is a schematic of some other models that had been some.

C

Other models that had nothing to do with multi compartment dendrite, no.

D

C

Ron's and they're.

D

Saying basically, mm-hmm thank.

C

B

The so they recently proposed the apical dendrite as way to integrate feed-forward and feedback information. So the the rationale starts with. How do you integrate feedback information to create like a local learning rule and their solution? Is you create an apical, dendrite layer and maple dendrite segment?

B

So there's the other work by dr., Zen and others has has. You know, also leverage the idea of apical dendrites and they use like a form of predictive coding where basically, the there was always the every diagram I found out. That model was a little bit complex. So sorry, this is a simplest one. I found- and it's still not super super obvious, but basically what they do is they have there's a feed-forward input that comes in to the soma and you can ignore. Assuming this comes into the basal dendrites, you can ignore this.

B

Like schematic thing.

A

But in the neural model, obviously.

B

It would just be like a feed-forward summation, and this dendrite segment basically gets information about the target this. So the target here is like what it's like a projection. It's a projection from you know what you want to reach right.

B

Let's say you want to you're doing eminence and you want to you want the output to be for the for the digit number, nine, so information about the digit number, nine that's been filtered in a way such that the output layer is clamped to it or you know some other manipulation comes in to this distal dendrite component, and basically they have this really clever thing, where the soma that they find like a reversal potential for the somatic voltage, which is so basic, which is which basically no zero zero voltage reversal potentials.

B

So basically the the feed forwards, the feed-forward weights, basically nudge-nudge, the nuts- are somewhat towards the feet towards this towards this reversal potential and then, in the absence of any imagine like you include the dendrite not for now in the absence of any dendritic stuff happening, the the this was all the summer would do. It would try to like go to this reversal potential, but dendrite has with it with its plateau and they assume like a type strong coupling between the dendrite and the soma.

B

The dendrite has you know, information about the target and basically for the sole month to now go towards this for the somewhat to go towards the reversal potential, which is like at zero activation. It needs to it needs to update its weights such that they would do that effectively. So what this is is a form of predictive coding really where they compare. You know they subtract the dendritic from the somatic potential or vice versa.

B

I mean doesn't matter if you Square, and they use that as a sort of little dynamical system that not just the soma towards a state toward zero. Basically, they want like the the soma to create basically zero output, and that's where the inhibitory circuit element comes in, so.

C

Before you go on, I have a question at the end of this one, mmm-hmm.

B

C

Can keep coming, I know this? Yes, when you done yeah, oh okay, well, I'm a basic question, and maybe it's just a more neural network question and not fill it specific for this thing and to understand how the target signal is represented. If you have one, you have a layer of neurons, you have some output representation, you want, but there's a lot of large connectivity between the layer of neurons and the output representation, and you got this. You know feedback. How? How is that I mean?

C

How would I mean don't we want to differentially modify each of those neurons in the hidden layer and how is it I had a? How was it differentiation occur? I mean how does the target? How was the target the differentially affecting different neurons in the hidden layer? It's a basic question: I, don't understand about neural networks, I, understand it and bring this by understand here, like that green arrow, how does it beam out differ between each of the hidden units or doesn't what does it look like? What's the signal look like they agree now.

B

Right, so this is the diagram I want to show you more. It was more complicated, so basically, it's just a direct projection and I think they manipulated such that. So this is a. This is a target input right.

B

Anyone is this a right diagram or is it the target output now I think that was that was feedback from the target. Yeah.

C

So it's like what you're, what you're hoping for right! It's exactly.

B

So if I understand correctly or if I remember correctly this, so the target here is clamped to your desired value, for example, or it's updated by the error somehow between, like you, know its output and the desired output, and this basically projects to so that just happens by the fact that the weights are are different, I, guess and so.

C

It's just showing a sort of a random initial weight from the feedback projection. That's at and now I'm gonna. Have you break the symmetry right? You got this. The.

B

C

Output- and you got a bunch of that's in every neuron and your hidden layer is connected to the output. How is it, how is it differentiate, and is it just by random differences in the connection weights and then they over time separate well.

B

Yes and also recall that the the sense the feed-forward component is also different for each hero, so there's different sensory inputs, yeah.

C

I'm wondering how the I'm just trying to understand what the representation that feedback looks like you know: okay I got this neuron and what is this? It's just some scalar value, that's being imposed upon the apical dendrite yeah.

B

Yeah sorry yeah: this is a scalar value that goes through the dendrite and.

C

And then, and then there's an assumed that the each day purple dendrite has its there's a weight associated with that. So how that affects you to neuron is different. Is that right? Sorry, can you repeat the last part I just don't understand if I'm a neuron and the hidden layer and I'm getting his feedback on my now on my apical dendrite? Is it just a scalar value, I'm getting and every neuron is getting that same scale. Every neuron in the hidden layer is getting the same. Scalar value.

B

It I think it is every each one of these neurons is getting like a different scalar value because well.

C

Because of the way yes, but they actually feed the blue arrow in this diagram, the blue thing from the mhmmm thing called new Association. That would be the same for every neuron and then there's a wait at the end of that to easier to differentiate. Based on that wait is that right, yeah, okay,.

B

They also get different sensory inputs, so some you're on some fraction of the neurons will be able to converge towards like minimizing this error. With this learning, algorithm and I, don't know what they do, because they've only read it their implementation of like how to do this learning locally in one year on I, don't know what they do. They make like a big network. Out of this, maybe some of you have like read deeper into this I.

C

Just tell me I I just said the basic most lack of knowledge about normal, oh yeah, sure sure yeah I mean we. We would assume that we would know in a brain that these are sparse, disturbed representations and they're from the that's how they differentiate easily. Here.

E

C

Basically, just that the weight of that each hidden unit has a weight associated with this feedback, but they're all getting the same number: okay, I'm, good, yeah,.

B

So there's like a number here and yeah yeah all right. Sorry, sorry for the point and no I understand this diagram is rather like complicated. So what they find is a basically, this learning rule where the neuron Trice, like minimizes prediction error, is equivalent to back propagation.

B

So that this is equivalent to back propagation so that you can actually differentiate this, you know using this learning objective. You could differentiate it easily and leverage the strengths of back propagation and credit assignment, and so this has been like pretty interesting work and then some of the work that you're given Lily Cup proposed, which I think is later I, think this is 2017 yeah thousand seventeen they, basically they.

B

They created this little diagram on the left, with the speed forward and feedback separation and the reason they wanted to do. That was because they wanted to solve credit assignment and biologically plausible neural networks. So credit assignment is the idea that, like okay, you know this. Given that there's an error in a desired output, you know, or something like I want to. I want to reach a desired output and I want to learn where do I put blame on which synaptic weights are in, which neurons do I put blame for being like hey.

B

Your contribution is the one that's like throwing us off so adjust by X amount, so hebbian rules at least two factor heavy in rules that are just pre and postsynaptic. They did not do credit assignment, even though they do produce like very interesting.

B

You know very interesting receptor fields that look like receptive fields and the cortex. Frequently though it might be like basically any nonlinear optimization will will create receptive fields that look like receptive fields in the cortex.

B

They they don't provide credit assignment these hebbian rules. So they're like how do we do credit assignment? So let's leverage this dendritic. You know this apical dendrite component and look at the Pluto's, so they have these. They actually use like a spiking. Real-Time model and I mean the plateaus, don't spike. Sorry, the apical dendrite some spike in the model, but basically create this plateaus right. So this is like you know at some at some time at some time.

B

T you know the the potential and the the potential in the apical dendrite evolves, and then they average this and in in each phase of training. So there's two phases in training here: sorry I could have ordered this in a much clearer way, so basically have two phases in training. One is like this feed-forward component, which is basically they put the data and, let's say the image, the amnesty age they feed through the network, and this leads to a calcium plateau right and then they have the bit where the feet forward.

B

So there's still feed-forward input, but they they also include the feedback input, and this leads to a different, calcium, flutter and the feedback. Basically, what the feedback does is it tells it creates the instruction signal it creates the instructions. That's what signal is like this is a plot, so you should have if, if you were perfectly representing.

A

B

Know if you were, if you're, if you were like perfect contributing to minimizing the error on this task and basically because of their setup, they can differentiate this difference between the target Plateau and the plots the feed-forward what's up, and so they update the weights are using background to to to match this. As everyone like clear.

F

On this I think that's what they do.

B

Cool and then like people to potential since they're since they're summing it since they're averaging it they derived like a sigmoidal function of it, so that that's what they used to differentiate so.

D

These are like some interesting.

B

Ideas of how to include apical dendrites in in machine learning and then- and this leads us to the the paper that I wanted to review, which.

C

It's just just a question from just a question about that. So are these: are these ideas? Have they been implemented in machine learning, networks and people? Have they done benchmarking on them, or it's just more, this sort of like thread of work they.

B

Have they have implemented them I think they each have like different different weaknesses and different strengths and from what I seem that they're doing follow-up work on this? But it's not like it's not like mm-hmm.

C

I mean have these to do this models improve upon performance in any way or me, I'm, just trying to there's a conceptual ideas and I'm just curious if they've moved beyond the concept phase and just sort of like we tested them and more robust or they're better at this year or neither or you know, right, I mean I, don't think current neural networks are using these ideas right.

B

No I think the motivation between both of these approaches is to make them more biological and the such they haven't like super optimize, the parameter space. So there are, they compare in fact. The second thing they compare favorably with you know, whatever state of the art was at the time also.

C

They have, they have implemented them so.

B

Only they have implemented them, but they have like. Oh this. This architecture leads us to like much higher robustness to noise, or you know this or that maybe.

D

Maybe, as a baby about I think just to be clear, these have not been tried in machine learning. The goal for these is to basically, they start by saying back. Prop is the best you can do, and the goal here is to see. Okay do biological networks? Does the neocortex implement some form of back propagation yeah.

F

Yeah, the idea.

D

Behind here is to make biologically plausible bottles of backpropagation, so.

C

I think they're.

D

C

Learning in any way, it's kind of a contrast to what we're trying to do where.

D

C

Saying, hey, we've got these machine learning models. Can we justify them by saying brains work this way, therefore, we don't think about brains, more and or our approaches, like maybe.

D

Yeah or maybe they're, you know they're trying to understand cloud the brain. This is a way for them to try to understand how the brain might do more. Complex learning like back propagation, does yeah and back propagation solves this credit assignment problem with deep, with hidden layers and they're saying well, the brain has to solve the same problem. How could it possibly do that because pure have been learning won't do it so these kind of structures they show that it basically approximates. What back propagation will do? Yeah.

C

Okay, so our approaches that flip it around to saying hey, given a model of what the brain is doing, can we improve.

C

D

You they're very few people doing that I thought.

C

B

Mean the this direction or new mental direction.

C

B

D

Direction they said, look at neuroscience and see how it can improve machine learning, yeah.

B

D

Of these papers are doing that here so far, and the other direction is to say: okay, machine learning is wonderful, yeah back propagation is wonderful. The brain must be doing some former back proper solving a similar problem and how? How could we have biologically plausible networks that solve the credit assignment problem with hidden layers and that's what these papers are trying to show.

E

B

They're, trying to yeah, okay, exactly and I went into this like I, think I'd call it like use of dendrites in machine learning. So this is all like I'm not trying to you.

D

Know yeah but they're not none of these are used in machine learning. Well,.

B

They're not used like in production or anything, but they have like they do train these on em nests or whatever yeah.

D

But they're not trying to they're not trying to get the machine learning community to use these they're looking at techniques in machine learning and trying to get the neuroscience community to appreciate it more and saying hey. This is a way you can understand the brains more and, of course, they implement the model and they have it running in simulations. But the goal is to have biologically plausible models are backprop mm-hmm.

D

B

Yeah- and there is like there's an implicit goal of this- is that, if you can is like saving things like computation power, if you have like neuromorphic chips and analog, you know low power, low power computation that you can do with spiking spiking models, and then you know, spiking models have this problem with the stdp etcetera. Were that you know it's unclear, do they don't perform as well? So if you can find some sort of architecture where you can, where you can do this with some continuous models.

C

B

That's partially it but I, don't think it wasn't main motivation here, but there are a lot of people who know.

D

Because I've asked them, their motivation is not to improve machine learning it's to improve neuroscience models of how the brain learns by using machine learning as an inspiration. Sure.

B

But there are papers that like try to justify at least their their work as being relevant to this to low-power computing yeah.

D

B

I've seen those and I think this couldn't fit within so.

C

We have this: we have the world that biologically inspired. You know biker with computer architectures and openness so now we're having she learning inspired biological origin.

D

Yeah I mean this is a huge. This is a big part of there's tons and tons and tons of papers on that yeah.

C

I I know that I just this work here it's going a little further and I hadn't I, just just curious. If anything has changed, it's pretty recent work, 2017 so I. Just now. If people I started going the other direction and I got my answer, the answer is no.

C

D

Yeah I think that would all paper does go, try to go in our direction, which is trying to use biological detail to improve actual machine learning. So.

B

B

But I did I I do find it especially this model. I find pretty interesting because it has like some interesting parallels with the with biological data and even though I'm not sure like if the neuron locally create calculates a prediction error. But you know the fact that you have predictions and I think that you're, like I, think Timothy Lucas excited about this idea too, that the feed for the feedback connections or predictions basically yeah.

D

Yeah so that yeah I mean it's the same in back, it's kind of the same in back prop, you can think of the error signal as a prediction. A prediction error well.

B

It is a prediction: error yeah, but here you know, the prediction: error is calculated locally. The target. Isn't the target projection is not explicitly prediction error, but.

D

B

You know Harris Lee, no.

D

No, that is the key thing, is that it's not directly available that you have to be able to do this in hidden layers. That- and that is that is the key thing here- that they're showing these local circuits can do. The same thing as backdrop can do, which is to have good prediction: error signals in hidden layers yeah.

D

So that is the key point here.

B

Well, yeah sure I mean like whether it's implement yeah okay I'm, an implementation wise. It does yeah, okay, you're right, yeah the localities based a big part of their argument, you're right.

B

So the this machine learning paper that I was thinking is a signet I, said sort of going in the opposite. Direction tries to use like this idea of having dendrites this.

C

Is from the moon paper yeah: this? Is it okay, okay,.

B

So, but basically, what it is is just max out with sparse weights. So this is, the left is a diagram of a normal, no two layer, neural network, fully connected weights and some kind of non-linearity here and then their their network basically has these dendritic branches and each one projects to one year on right. So each neuron can have any sum any number of these, these branches and they're. The other thing they stress is that each branch in a given neuron will receive a mutually exclusive input, so there will be no overlap in the input.

B

This branch gets with this branch. Different branches on different neurons can have overlapping input and.

C

How do they enforce that? It's you're sick of programming before.

B

Yeah, it's a little for sparsity that they also are really excited about. So it's not a random, completely random sparsity. It's not like a random mask. They. They create like an index for each for each branch for each of these, like output weights here and that's how they also proposed to like minimize the computational load of doing this. So instead of multiplying this by one big mask, that is like a number of input, neurons times number of branches.

B

They have like a bunch of little functions, basically assigning each input activation to a certain branch, and then they differentiate on that operation and yeah. It apparently saves it apparently say it's computational resources.

C

Not that it matters here, because this is just a neural network, but that that's not a way that real neurons work so really a real large. You, wouldn't you wouldn't find a single cell, making multiple synapses on a single branch like a dendritic segment, a segment, but you do find the same axon, making multiple synapses on different segments of the same branch. So this you know like the ischium there are it's not differentiated by branch is differentiated by segment of dendrite, which are many more segments.

C

Then there are branches mm-hmm, and so the idea that I'm just pointing out it's not detracting from this model law, because this models trying to do a neural network but I'm just pointing out for those, are interested that you wouldn't see that kind of segregation, an axon would you'd be one of these colored lines from the input neurons would actually make often not always but often will make multiple synapses on different segments of the same branch. You just don't see them, making multiple synapses next to each other on the same segment. It just.

D

C

D

C

Just just filling in some details in but.

B

What would an axon make connections on different segments in the same year? Oh yeah, they.

C

Can't yeah yeah.

B

C

B

And in are the.

C

Same branch, the term branch refers to the major segments that attach to the soma, so prayer on their own may have three or four branches of which each one has many segments. So it's just it's just a different level of parsing and again it's it's not important for their model. It's just a biological detail. That's all! Oh yeah,.

B

Of course and I think they use the word branch like it's an abstraction. It's not really like.

C

Well, they group they group all the a or they're taking they're going from a single point. Neuron just think. Oh nine has multiple branches. Everything.

B

I think not a soma is a branch yeah yeah.

C

And and and we you know a general model, you said: oh, we have, we have a soma, we have a proximal input and then we have multiple segments. Each segment is like an independent little coincidence. Detector, our model is not complete either because we don't the HTM neuron model, doesn't try to make any distinction about the dynamics of Anurag and Hritik branch in terms of all the different places. The branch split.

C

So when the neuron dendrite branch, splits is kind of complicated dynamics occur at those intersections that so it's not like our model is complete either, but it was just different levels of abstraction. Oh absolutely I mean smiley.

B

Backdrop, so it's it's a completely.

C

Yeah I don't know because it helps me to keep in mind what the differences are here. No.

B

There definitely is useful, yeah I'm, just saying it definitely, this is not like it. Okay, trying to I mean they're trying to like making they motivated us trying to make it like closer to the complexity of real neurons, but it's.

C

Getting closer so it's good, but it's there's just I'm just pointing out the differences yeah.

E

At the mother level, if you have same axon with several connections to the same brain, can't you just interpret that as having a higher weight and that'll.

C

Because because the the integration zone on a dendrite is not the branch, it's the segment, it's in the HTM neuron in the or neuron paper in the neuroscience says that the the inputs to a dendrite are combined over a fairly short distance. Like typically they say 40 microns, the forty thousands of an inch in you might have. You can have up to let's say forty synapses in that distance, so there may be. There may be dozens of Bragman and thousands of synapses, but the integration zone. What would be the equivalent of like a little neuron?

C

It's only a small section of a branch, it's or a segment. So if you, if you have a synapse, if you imagine having this one, a generating branch and this it's Forks out like a tree in this there's hundreds or thousands of synapses all over room, if you ran, if you just activate several of those synapses on there, if they don't some at all, they just, but only they only some.

C

If you activate several synapses right next to each other, very close proximity on the same segment, then they some, but otherwise they don't sum it all they just it's just like it's like totally non communication right.

E

But but but they're calling brain can this model would be equivalent for segment. Then I.

C

Don't think so, I well I guess you could maybe they call it that with? Maybe, maybe that's for me to look at then they're saying. Okay, these neurons have two segments which is very limited, I sort of interpreted it. They really meant a branch, but maybe I guess you could look at that way. You could say if, for example, I you could look at that. What I guess I don't have any actual neurons here, but a real segment on a real neuron again could sort of maxes out at forty two synapses.

C

That could be some together and here you might have I, don't know how many of these have my input neurons. You have here how many lot.

F

E

C

B

Parameters that they.

C

Imagined I imagine there's hundreds if not thousands right, I, don't know yeah.

B

I mean this: their models are pretty small, but they they're trying to make it proof of concept, but yeah.

C

I think we've beaten this up. This burning question I.

G

Have one question so you mentioned that the if you look at where the branches diverged, what you're calling branches diverged that the that the the model of how those interact is is complicated. Do is there a simplifying way of looking at that understanding the segments represent individual compartments, but when they converge on actually the the edges converge. Do we know what happens well.

C

There's a lot of physiological data on it, but it is really complex and I, don't think, there's any consensus. Okay,.

G

Sounds like is important. Well,.

C

I think most the research on this would be the following: they, they would just say you imagine, you have a single branch looks like a Y right, so you have two segments coming on to another segment and what they show is that if you, if you activate a synapse on one segment and active, sits another segment, there's a little bit of depolarization and what you find is that the that depolarization doesn't doesn't slide through the junction.

C

The junction is it's a different impedance match if you will a different impedance function for the junction, and so, if you try to match a voltage gradient going through the junction it gets lost, it doesn't work like it up into the junction. You can model it as this, like leaky pipe, but the junction all kinds of crap happens. So you see a lot of models about that.

C

However, our believe, well, our model in this evidence to this is that once you have a dendritic spike like an end of the a spike, it is able to travel through the junction. It doesn't get highly transformed. So, if you're, if you're imagining inputs from different brand, if er integrants are getting combined downstream after going through these junctions, it doesn't look like that's easy or it happens, or it's just weird.

C

But if you assume that there's a dendritic spike traveling down a dendritic branch or segment and they seem to be able to get through the Y Junction and they keep going so.

G

So in that model the Y junctions are just simply connectivity, they don't have any yeah.

C

Yeah, well, that's the way we model it and my my point is that some neuroscientists would disagree with that, because if you study how the complex dynamics of those Y junctions- and they say well- it's really complex because we did all these studies and but the idea that that it's not nearly as complex, if you assume a dendritic Graham's, so I mean assuming a dendritic spine. So we model it as if they're non-existent. There's other there's other evidence that suggests there are some things going on there.

C

For example, there seems to be some evidence that even then juridic spikes it matters how far out they are on the tree of the branching and and if you're, going to if you're gonna. If you're gonna activate multiple dendritic segments, the order in which you activate them, it can make a difference.

C

So there's some there's some evidence that if I start at the very end in activated indirect spike and then activate and whether gender expects partly along the way and another one that the whole signal propagates better then you've got it then so, like there's an order of preference, then if you do it backwards, we don't model that you haven't tried to accommodate that at all. So.

G

So one would think that if you have the dendritic spike traveling down, there's a depolarization that occurs that another one that might occur a little bit afterwards further on up on the Y might have more difficulty propagating down. Because you know things are.

C

C

G

Alright, okay, thank you.

B

Right thanks for out there right so realistically or not, this is their their model, and this, like I, said it's. Basically, the operation the dendrites are doing is max out, so max out is basically the fact that you, every neuron the instead of having some kind of deterministic nonlinear activation function such as a sigmoid or a low or whatever they take the they have a number of branches that come in and they each you know, have like different different input and they select the one that has the highest. So what this. So.

B

This is a little like a learned, activation function and there's people here who are a lot more expert at max out and other like learned activation functions. So please correct me if I say something wrong, what it can do is learn piecewise, linear functions and like, depending on the number of you know, let's quote unquote branches you get per per output unit. It can become like an increasingly complex piecewise linear function, so this. So this is a generalization of the rectified linear unit to several other, like piecewise linear functions. So the rectifier is just this.

B

It's the max of you, know the max of X or 0 right. So, if it's smaller than 0, then you return 0. But you know if it ends up doing something like the absolute value you can get like for negative x. You can get like this guy here and right and so like it. This would be one branch, and this could be another branch and have multiple branches could give you something like a quadratic function. Is this more or less correct.

D

Yeah, what's the difference between this and K winner with K equals 1.

B

The difference is that a winner is a strictly positive.

D

No, no! That's not true! Okay winner! Oh that's,.

B

F

Yeah I think I think the difference here is that, if I understand correctly in Maxo, all the other activation values are the same. Is that right since you're applying to the max each each activation.

B

What do you know the activation values raising.

F

What are you taking the max over here so.

B

It takes the max of each one of these branch units so.

C

The neurons, just like a pass-through of the maximum of its branches, just.

B

Yeah cables, one basically, winner of the branch yeah.

D

I mean they don't have boosting or anything here, but if you ignore boosting it, this seems to be almost identical. There's differences in how they've set up the connections with you explained before right other than that seems like the same as k winner with k, equals 1.

B

Right, that's true yeah here. The difference is that each one and with K winners like in the max pooling layer we have is that these there's far sways from whatever these are to the neurons. And then these the top K of these get go through with K equals 1. Here they have like full connectivity from the branches to the neurons, but there's far sway to the branches. So that's like a little bit about like a structural difference that would influence what the--.

B

E

I can't see indifference as well. Just read, usually max out these years with local regions that could be the equivalent to K winners equals one, but local community that would be max out. I mean, as far as my understanding goes and I dunno yeah.

D

Maybe this in the simple leg linear case there is no real difference right.

B

So and they partially motivate the idea of having the sparsity, which they stress, is different than other types of sparsity and I've been tried up to. The point is my motive by the idea of the model in sampling, where, if you combine like random features or random components of a model or something like drop out, basically you get better performance and hopefully, better generalization.

B

But though, though this is similar to drop out they in their experiments, they do not compare this to a maxout activation with drop out and max. That was designed to be compatible with drop out. Basically, is the motivation so performance wise in the paper itself. They only show training loss and not not test lost. You know just only in the sub lines, but they basically make these graphs where they show the number of branches per neuron. You know for a given network size given layer size and they combined their model.

B

The dendritic neural network with this is like layer normalization with an R, lol or Bachelor Malaysian in reloj, and this is the training loss basically so like they effectively like. How quickly does the training loss draw and they find like there's a sweet spot for this task, at least- and this is on the fashion inist and they did a few different experiences. They did on the C for 10 C for 100, and then the UCI data set at the end and they find like a sweet spot of the number of branches right.

B

If there's too few, it's comparable to max out- or you know, layer normalization if there's too many, it's also the same and in between it. Basically, trains faster I, don't.

C

Know just really basic what does training.

B

It means did the loss like between the output and the target.

C

B

So they could have plotted the accuracy here. The loss is basically, you know here. This would be a this is categorical, so this would be like a categorical cross entropy. So that's between the you know the binary vector of outputs and the binary vector of targets, but why is it called training.

A

So when you plot accuracy, you're kind of threshold and you're saying like what was that, what you're saying the network had one guess it's a class of Ida is that's one thing. Training loss captures confidence in it. If it was really confident and the right thing, it gets a lower loss if it if it gets something wrong, but it assigned 50% probability to the correct answer. Then it doesn't get penalized.

C

I guess I get that. Why is it called training lost and why isn't it just called like.

E

D

E

Think what def is asking that there's a difference between training and what we call validation or test loss and training just means you're. It's the lost on the same data set your training, your model in.

E

Test loss. That means you train on the different data set and then you buddy.

C

E

Patient was means, regular J means sorry generalization. It's basically.

C

Lost on the training set, yeah yeah.

C

Thank you, yeah, sorry, that was a simplex mission. I want it.

C

B

Yeah- and this is exactly as they as their training- they plot the loss for each iPod and they find that basically, it goes down faster at some point with their with their model, with 16 brunches premiere on, for example,.

D

It is really weird not to show the validation, yeah.

B

It is weird they could show it in the supplements- and this was a little bit fishy too, because.

D

You put that, like overfitting, you know when it gets one training, less loss gets too low right.

B

Yeah, there's that absolutely.

E

When I show know that they can over fit, but their motto is has enough capacity and it learned so it over fits.

B

And they compare so the thing they have like they compare two versions of their model. One is standard one with max out the other one is with the same, you know as far as weights, but instead of max out, they have a rail and average pooling afterwards.

B

And what- and you know these these lines here are the you know the test accuracy out there. This is again on fashion, and this test, like you, see other model based on branches per neuron right, and these little guys here are different models.

B

These are like layer, normalization or rationalization, and one thing you'll find interesting is that these are on the same tasks, these different networks trained on the same tasks, but these thoughts are different in both in both plots, so I'm not sure what like the comparison, is supposed to be here, because if I was making splat I would have these constant and just show the different different activations here alongside them for reference. Wait.

D

Wait: I didn't quite understand what you said so.

B

D

Here, for example, this is.

B

This little hexagon is the the performance of match norm rather with 512 units, Nitin layer, yeah right, and it's on the same task right test accuracy over here when they use max out on their model. It's over here.

C

Over there, so it's at a different position.

B

Relative to that's normal q56, it's the same. It's the same. Icons are missing exactly so. It's the same network trained on the same data set, but it has different test accuracy.

D

So that is what are the titles, but these the titles of these two graphs are different, so there must be some differences between the two. These.

C

Are using a different function for the norm? Wants a max out and one of the over a little plus yeah.

B

But that's for the D D D at double end model, not.

C

For the other models, oh sure, the other models should they should be down. They.

B

Should be identical as far as I understand so meeting.

C

Boys they clearly aren't so there must be something else going on. But what.

B

I imagine they did is this is like a different training run.

C

They're fundamentally, differently shaped all around they're, all singing and the OP one they shouldn't go up and down here they should have down and up I mean oh yeah,.

B

Yeah, but this is for their model right. What I was pointing out was the you.

C

Know the other ones, the other ones, it's hard to believe that the other ones are different just by random seeds, areas.

F

Why why um do you think leading to plot the whole curves for this? Rather, they held the number of branches fixed for the ones. Oh.

B

Yeah sorry, if this wasn't clear the the other models, the models of comparing two do not have a concept of ten by branches. There's.

C

They must because they're showing that they're showing the curves change over the over the axis labeled branches for non yeah.

B

This is for their models, so they have. They.

C

Don't have a you're just saying, you're saying that the red line, which is not their model, which is an additional model you're saying like oh I, see I, see the other icons are not being plotted they're. Just saying this is what the fixed value is. Yes,.

B

You're, pointing.

C

Out that the order in which those dots appear is different, I got.

B

Yes, yeah a little.

C

B

C

Difference you're saying could be just the fact that there was a.

G

It's also the fact that the y-axis are slightly different too. So, if you're trying to visually a B, that's.

C

B

As 0.86 here in point eight seven, seven.

C

Slightly got it: okay, I'm, sorry, I, didn't understand this Detroit yeah.

B

But they're ordering is also different. It's not just the relative spacing right, so it's a little bit strange, but yeah. So the these. These models, these the diamond and triangle models. They do not have a concept of branches per neuron, they're normal, fully connected models with one hidden layer and the the number.

E

B

Is the number of units in the hidden layer.

C

So earlier you said that it was was it 16 and 64 dendritic branches, but here there are two here they're starting at 64, so it's drawing out a higher number. Is that correct, like the black line? Is the en and sixty-four ice? And that means no, that's, not branches for non. The branches plotted down below nevermind I got it think about. This is the hidden. This is also the hidden layer. Yeah I got it right, so the branch Minar it's down below and 16 is like a sweet spot. Yeah.

B

Exactly and the hidden layer is the one that has the the branches here at the input and not.

C

B

E

Are they showing their that max out only works with a small number of branches Panora and then, if you go a larger number, you have to use average pooling set it that, for these, two plots are showing so.

B

If you go too far, you have average pooling yeah.

E

Because max out only works with small numbers of printers, and then it goes down as increase the number of branches. Rich pudding goes up as you increase the number of branches right, yeah.

B

Actually I hadn't heard of average pooling until until the reading newspaper so I'm not exactly sure how it works.

E

What is just taking the average of all they're, like you, get both of 64 branches, you just average them right.

B

Instead of like max pooling where you take the highest one or whatever yeah yeah yeah, so they find this trend that you know increases with the number of branches per neuron so max out, yeah, so relu, plus average pooling increases with the number of branches per neuron. Basically, I guess you're deciding more parameters and there's a sweet spot for the branches per neuron in in their model, and little max out with this florists incoming whites.

B

Well, I'm not sure just looks more like a decrease here, but there you know in the training loss they looks more like a sweet spot, but I don't.

C

Know I well, that was the point was made earlier, that the training loss is not really a good indicator right, but.

B

I guess the leg Marcus said there, it does have. Does.

C

F

Something right.

B

It's not like your performance of it's not really the performance of your network, but it's something about you know. The confidence of the predictions are. However, that's.

A

Just where I was talking about loss versus accuracy, but training versus test loss is a different conversation. Yeah.

C

Anyway, according to this, if I interpret this correctly, that these is this model, homing really does better than existing models under the scenario of like 64 segments under those two, those two lines on the bottom: we've got it blue and the dotted green. Those are the only two that sort of perform capably. That is, that the correct interpretation right.

B

Yes, so the knot.

C

On the top one on the bottom, the bottom of the rear average pooling right the best.

C

Well, the green, the blue one of the best ones all around in Voltron yeah.

B

Highest biggest hidden layer, yeah yeah yeah, and they find that you know, depending on the branches when you're on that you know how it approximates the performance of these other networks that don't have to so there. They don't get better, better test accuracy or anything like in any of their results. But the the title is improved, x4 70, so that was something that I needed to spend a little time or expressiveness looking up. So basically, what a neural network does is this right?

B

So let's say you had like a non non separable region like the X or problem, or these two overlapping sorry wait for it goes. You know. This thing is not linearly separable. What in what a hidden layer does? Is it twists the space of the problem such that these become linearly separable right, and this post of Chris Nolan's book is really cool and there's there was a paper in 2017 that sort of tried to quantify.

B

The number of you know distorting lines that a hidden layer creates or like a network, creates as a form of expressiveness or the number of configurations that are like a well the computational complexity of the computation preferences. The complexity of the operations that the so.

C

The word expressiveness. That means how many and how much you can stretch the space to make these linearly separable investigations.

B

Oh, how many spirit, how many different stretches you can make? Okay, I think is the idea, basically how many of these lines you can create. So if you have like on the input layer, let's say you have one two, three four four units when you train it. These are the lines through this face. If it's a 2d space like in this in this image, these are the lines that it can separate through right. So this is like you could separate this one separates.

B

You know this and that as this one separates this, not if you add a layer on top of that, basically you create a new bunch of lines, but it's this bunch, this bunch of lines times the number of units in the previous layer.

B

So for every unit that's active in the previous layer, it will drive a different permutation of these lines in the next layer, so so each so this doubles called the quote/unquote expressiveness of the of the network because you can have, for you know you have a much higher space of separation right so for for this guys activation you, let's say one neuron can do this piece, this piece, wise, linear separation and for different for this activation. Let's say it's the same: neuron does this one?

B

Actually, maybe the yeah, maybe the neurons here each it's like the shade of green, is supposed to indicate a different your eye and then.

D

C

Get a player looks to me like what they're saying when you go from the layer zero to layer, one you're, keeping the first set of neurons like black one and now that green ones are basically are dividing up the individual from black line, the black line, so that that's the huge sort of subdividing it's you're.

G

More friendly partitioning the space yeah yeah.

C

But you're still keeping the first partitions right and then no and now.

B

We have that, yes,.

C

Now, when you had the purple lines, you're still got the green and the black partitions, but now you're subdividing those still exactly.

B

So the idea is that, like yeah, the layer two will only be like these purple lines, but they.

C

I've never seen this picture before like there's so much. It's just a commonly understood description of what's going on I.

B

Think this component of it is this is like pretty basic.

C

Shown in a simple 2d projection like this well.

G

That's kind of how support vector machines kind of look, so that's this is just.

C

Yeah, you know it's funny, I get that, but I've never understood that it was a so I would support, vector machines. I said no, you got all these divisions right, but I always thought they were I thought it'd be more like you, the green. The black lines went all across the figure than the green lines. I got all across bigger and then the lines go all the across before they're, not they're. Only each successive one is only going drawing the line between one of the previous segments. Well,.

G

Iii think I mean this is a projection, but I think the reality is, is that you you, you keep partitioning in higher and higher spaces, so these are actually volumes. This is just a represent.

C

And that but I guess I didn't understand. There was a progression like this I was support, vector machine introducing, oh, we just we just divided up the space into a lot of you know, planes essentially and and then you got a bunch of components, but this is sort of saying that you're saying this is a progression. You're you're you're to hierarchy of divisions going on here in some sense.

G

Right, well, that's a support, vector machines only have just one layer, just you know very friendly southern buddy. This is cascade them you're, going to get.

C

I just didn't I never saw this wasn't deep support.

B

G

A

G

One question I had on the previous side, with the the the numbers they were appended on. Each of those things was that the width of the of the layer.

B

You mean each one of these guys.

G

Is that the width of the of the layer, or is that somehow relate to some kind of depth? Didn't.

B

Know that is the width. That's the number of units of the in there. Okay.

G

So, basically, what they're saying is the reason why we're winning with the higher the I mean. Well, that's one of the things we found in sparsity is that we have greater width. We have if you wish more expressive power, so the notion of when you create those when you had that representation of where you know you can model various functions and then you can get you know, piecewise, linear.

G

Basically, that's saying that, with the more with the greater width, you can have more and more complex if you wish separation planes in some sense, but you know, then that has more ability to in the in the next slide, we're showing all those stretches. Basically, it means that you can kind of thread the thing in between those distributions better is that kind of fair.

B

But the measure complexity doesn't yeah the measure of complexity here, and this paper has like a number of different measures, but one of them is basically this. Let's say you train the network and you you present an input, so you go, you go through a trajectory and one of the one of the measures is the number of piece while the number of linear.

B

How do you call transitions? Number of linear transitions per trajectory? Is a measure of expressive expressiveness. So here you have like every time when you switch, you know in the input space you're gonna affect you, know these these activations here and that will lead to a different activations here and then the purple ones, so you're gonna have more linear transitions. So, for example, here, if you your trajectory, let's say you start here and you just go around right, so every at every T present like a different part of this input.

B

This is a trajectory that's curved out by the activations of the of the subsequent layers. So let's say this is layer, 1, layer, 2, layer, 3, and you find that with more layers, you get a higher. You know more complex trajectories that are traced out, because.

F

A

F

B

Between and yeah, so that's what they call expressiveness and they find that the the width isn't actually the very big contributor to this. After some point, there might be a little bit of a trend the beginning, but what seems to be much more of much more predictive of it as Network depth, so assuming the assuming on nonlinear units in each layer. The network that is like very you know, strongly predicts the number of transitions, so quote-unquote model expressiveness.

B

So by this regard, this paper outperforms, you know equivalent max out or standard relu Network by Counting. You know having these transition counts or with a given trajectory now so I I, don't exactly know what the trajectory is. They showed how they did it like, as in you know, like you, have a a plus Delta, you know, and they they point out the theorem and they prove well. They they demonstrate that. You know this.

B

Their their architecture is also Universal approximator, like a standard, you know one hidden layer network, but they the transition counts for the same trajectory are higher in their network than the max out network. With the same number of units in the hidden layer, end of a network now the thing that they didn't do as I'd mentioned before, has compared this with max thought that also had drop out. So in the other networks, they did batch normalization and other tricks. That would improve the accuracy of the networks, but they did not drop out.

D

So the value networks, how do they do branches per neuron? What does that mean? Oh.

B

D

Yeah, if you look at the green curry,.

B

Yeah yeah, so sorry, so this is with this is with the dendrites, but with with her okay.

D

Okay yeah, but I thought there de NN had max out it.

B

Is max out I think the difference is the the weights varsity.

D

So what they're calling max out there has no weights varsity.

B

Yeah, let me double check.

D

It's like every dendrite has a full connectivity to the previous.

B

Baseline so baseline relevant, full.

B

Full nur I think FN n stands for like fully connected neural network for the max of network yeah they so for the dens day. They have like a different branch number for the max out network. They increase the number of kernels in the back song unit. Okay, yeah, that's the difference. Is this varsity? So it's like a fully connected layer with max out and then for the relu dr.

B

yeah, actually I'm not clear about this. They don't really mention what the what happened. So I was curious about this too I think it's supposed I, don't think it would be straight, but it's not unless this is an illusion. That's going down by the fight that there's curves going outwards right. Is it actually actually I think it's actually straight yeah, so I think this is supposed to be like a set value and they're. Just it's a visual aid for comparison? mmm Okay, so it does it look. Does this green line?

B

Look like it goes downwards to you guys.

D

It's hard to tell it doesn't look perfectly horizontal for sure right.

B

D

B

I think yeah I think it might be an illusion, that's kind of cool right. So that's, basically the main driving point of the paper is that oh look. We have this a higher number of linear transitions per trajectory, so this model is more expressive than senator max up. But as I said again, they don't compare it with dropout and they don't compare it with other types of sparsity either and by Texas Parks came in like how they implement it.

B

You know mutually exclusive I pretend right and they claim that they reduce the computational cost in memory by having just an index in index.

D

Yeah I can I can definitely see that that way, it's identical to a normal without them right.

B

So in conclusion, you know: I presented a couple different ways of using quote: unquote dendrites and your machine learning models, and you know some ideas or to separate the feedback. / error pathways and that's one thing you can leverage using dendrites or you know, do you increase the capacity and expressiveness and there have been some papers made by you know: Boyd, Ozzie and Mel, and for NASA's lab data. You know they're having a number of nonlinear dendrite segments. Inclusive increases the the memory capacity of each unit, so yeah with this.

B

That's the end of my much feel if you guys have any questions or comments. I.

E

Have a question: how does this model compares to the one you showed us last week and before the one you're working on have you gave some far about this? All.

B

With the yeah so that I'm still debugging, so it learned stuff but III.

E

Don't know asking the results: I mean just how they're all like qualitative.

B

Like structurally yeah yeah, so the model one of the models I'm working on- has like a similar concepts feed-forward and has a number of dendritic segments per per unit and there's also sparse weights coming into it. So the motivation for this was increased capacity. I wanted each. You know for each branch to having different branches, leading to the neuron being able to respond differently to completely different inputs, so this would be, for example, this would be for continual learning where you know you'd have different tasks.

B

So let's say here you know every branch would correspond to like a task roughly right and then one thing I wanted to do on top of that is use a type of feedback. So this would be you know, feedback from the output here, but in reality it's just another feed-forward layer, which was the input number being like the number of number of classes, or you know you can design it. However, you like that, has where's the annotation thing there we go.

B

Does this look like.

E

B

Now why is it all drunk sorry right? So the idea is to have like a little like this sorry for the bad annotation, the little feed-forward layer that creates.

B

With it, with sparse projections to each branch right.

B

You know etc. You can imagine and the idea being, that there'd be some kind of gating or some type of coincidence, detection such that you know when some feed-forward component or feature is co-occurs with the relative, the the categorical input. So let's say if this isn't clear to people- let's say you're training on on task one, and this is like the task one categorical neuron, so task one projects to here and here right.

B

So when this input comes in to you know this, these dendritic segments and the co-occurs with this categorical input that you own, that you only drive the only drive, wait, update or learning that and the idea is that, because it's varsity, you would not. You would get for each one of these tasks you wouldn't get. You would get minimal overlap between between the dendritic segments they projected so that you would hopefully have a different population of dendrite segments per task.

B

That's the idea.

B

E

So it's the same basic core, but then you had this extra category going, but it's gonna drive the learning to be specialized for a branch potato.

B

And the other thing is I mean right now: I'm trying to work out first I mean I, have played with fighting the categorical input. It doesn't makes things worse, but that's because I haven't like optimized anything the well. That was that was inaccurate.

B

What excuse me, what I'm doing now I haven't I, don't necessarily use maxout I have set up something over to Maxell, which is the cables 1k winners, four for each for each branch per per unit, but I haven't settled on an activation function for this, but it needs to be some kind of non-linearity right and then I want to see like if just having these branches increases the capacity than network and some meaningful way and.

D

There's one big difference that you kind of mentioned this is that the way the sparsity is that not very different here they're each dendrite is mutually exclusive inputs to every other dendrite on that same neuron, whereas in our case we want to distribute it encoding. So we don't have that restriction exactly so.

B

E

Will be I think that makes.

D

A big difference in.

E

Both weight because you have turbo X, that's coming from the input and double y, that's coming from the categorical input like the tasks embedding, so what you're saying is you have free distributed weights in both double X and double y? Are the bow ties in that case that it yeah well.

D

Definitely in the input in the categorical, if it's one hard encoding, then it wouldn't matter, but if it's a distributed encoding yet and then we'd want to distribute it sparse encoding in both cases. But here even for the input, the inputs are mutually exclusive per branch, which is kind of a weird they're all doing it only because of computational efficiency. There's no algorithmic reason to do it. I think that way. Yeah.

E

I think there is a biological reasoning, the paper that you know an axon. It's only gonna like you only have so each narrow each nyrians only gonna connect to another neuron one. So you could couldn't go back to two branches, I think yeah, I think what they say is that one axon from one year is not gonna connect you to branches for the same newer, the same output mirror and they.

B

Depending on what you call a branch that my may or may not be true yeah, but yeah I, think that was like an idea. They had like in the abstraction.

F

So areas, what is the Lucas? Are you still are you doing that with your question, yeah.

E

F

Areum with is this: the small Sigma is that where the Maxo is applied, yeah.

B

Yeah they use the big Sigma.

D

B

The summation okay.

F

So that means that it's basically in normal machine learning sense it's like taking different transformations of the data and then just picking the one which has the highest activation value right.

B

Yeah it takes the it takes the max of the vector of like branches and.

F

So I stalk: what's the intuition behind doing that in this architecture, I.

B

Think the idea is that they just like max out I.

B

Because it has this, you know this cool property and in their introduction they mentioned how like a lot of the recent successes, you know a lot of the improvement of deep learning in the past few years has been due to learn. Learn herbal ekta vation functions such as myself, so they wanted to use one of those yeah.

D

I'm not sure what they mean by learn about activation function. The the activation function seems fixed. The only thing that's learned are these weights.

E

Like our key winners, right, like yeah,.

D

E

It's like okay, it looks like we are learning it's like relu, but we're learning the crash out so backs out it's kind of the same thing. So that's actually what I think they mean by learning we're learning that try showed.

D

Okay, but that threshold is not a fixed threshold. It's dependent on the inputs, yeah.

E

Yeah yeah, it's depend on the input, but it's also sort of learn right. You learned what both yeah it's dependent and it's learn as well. Yeah.

D

Yeah, maybe we're splitting hairs yeah, it's definitely learning the weights, which has an impact and the max month. The activation function is fixed right like like when we have a sigmoid with a bias function. No one says we're learning an activation function, even the sigmoid. The bias is actually moving the zero point back and forth right.

D

E

Are learnable activation functions that the activation of our country, which actually have parameters and yeah.

D

Yeah exactly no! This is not that what.

B

I understood from the learner bowl thing is that you by you know changing the weight, so you can learn like a different function here, like you. Could this could be imagine you this could like be morphed into something like a sigmoid or some kind of you know up and down thing. The way.

D

It's to Parelli to.

B

That's that's true for combination of rallies, but with max out you have yeah, but I need to think about this. I'm not sure full understand the true Pharrell a bit.

B

But what okay? The rail is just this right, a.

D

Single one yeah, but if you had this architecture, where you replace max with rail you everywhere, as you increase the number of branches, you would learn.

D

You could learn these different quadratic functions. Oh.

B

You mean that relu, the positive region would only happen when there's some. You know some interesting some kind of function and the rel is projecting to it is that.

D

We mean yeah yeah, it would be, it would be a two layer, Network yeah, we're learning function so.

A

So I think a point here. Part of this is just language and how we're choosing to talk about it. But, like the point, is these neurons here? These little mirror Sigma neurons here can learn complicated like shapes and the input space, whereas a normal neuron is always just like a hyperplane or it's always just like a point in some direction and a threshold. But here the threshold is kind of curvy like these, because they have these multiple branches and that's so. This.

D

Is where they're.

A

Coming from so.

D

But that language.

A

We were just we're choosing to call the non-linearity suddenly it seems like a silly characterization, but if you think of the whole system as a whole as it as the neuron as a whole, the function, the neuron is learning I. Think that's! Where they're coming from exactly.

D

Yeah I totally agree with that. Absolutely the neuron itself is more complex. This way, yeah, okay,.

B

Yeah I was confused about what people men over there who it's.

E

To go back to current question, I also believe no I'm, not sure I read this paper like two weeks ago, but I also think they have a biological motivation for the Maxo that you only need one brain too far for the neuron. Far you don't you don't need more than one so I think the max out there. It's also biologically motivated.

D

Yeah, but that's not true, actually that goes back to what Jeff was saying earlier. A single branch activating is usually not enough to make a neuron fire, even though that's how the original / Ozzie and male they modelled it that way, but that's not actually physiologically true right.

E

D

Spike or dendritic spike is not sufficient to a single one, is not sufficient to make the neuron fire.

C

That's the genesis of the prediction: yeah.

B

Yeah I, don't remember where the the biggie mentioned is Lucas about the this motivation. I.

E

Might be wrong, I might be so know.

D

That I think that is, that is how boy, Ozzie and Mel originally modeled it that way, and this guy I think that one of the authors was a collaborator with Bartlett mill a while ago. So I wouldn't be surprised if they had that same kind of assumption built in here. It's.

E

A surprising build they're somewhere in that text, yeah.

D

There we go. Oh.

B

There was an interesting structural learning rule paper that I'm sure a lot of you have seen the the motivation for it was a neuromorphic computing where's, the Firefox ting. Is it here.

B

Sorry I mean we could stop recording you guys. You know this is a irrelevant, but let me find this paper. Oh.

E

B

So sorry, how do you delete annotations, I.

C

B

Of like having it floating around myself,.

C

B

Let me let me just do this, there's.

D

A clear button with the annotations, if you go back to annotations now, I'm just going to be like.

B

Yeah, so I didn't have the time to read a lot through this paper, but they basically do a morphological, 100 morphological learning rule which is similar in spirit to HTM. You know like you're, given a certain condition. Do you increment or decrement, or in this case they they create or subtract new binary, synapses.

B

So I thought that was kind of cool.

B

Anyway, any other questions.

G

Could you put it back up again? I was trying to screen capture the title once.

B

G

The hell out of it so I.

D

Can send it over slack to.

G

Yes, thank you.

A

They go, go ahead and stop. The recording I will give us 18 minutes to lunch. I.

C

Sounds good? Okay, thanks.

D

C

You wishing hi Thanks.