Numenta Live Streams, 20 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sparsity in Neural Networks (Brains@Bay Meetup)

Description

Some presentation slides posted on the meetup page: https://www.meetup.com/BraIns-Bay/events/263945823/

0:00 Subutai Ahmad - Sparsity in the Neocortex
24:05 Lucas Souza - Literature Review
43:46 Hattie Zhou - Deconstructing Lottery Tickets
1:19:00 Gorden Wilson - Sparsity in Hardware

A

Okay, well just get started. Lucas asked me to talk about sparsity, so this meetup is all about intersection between neuroscience and machine learning, and we can learn from neuroscience to impact machine learning, and so today's topic is sparsity and what I thought I'd do. It just gave a review of sparsity and then your cortex and what I'll focus on is just that experimental data and then the rest of the talks, I think are more machine.

A

Learning focus so I'm not really going to touch too much on theory, although I'll have some opinions along the way and looking like.

B

A

Sympathy and I work at momenta. You can follow me on Twitter they're. More importantly, this video here was taken by Professor.

C

Hasan using.

A

Their recording technique, all optical imaging and counselled imaging. Where he's looking at firing of neurons in a mouse, he was just somatosensory cortex or touch cortex as the mouse is engaged in active tasks, and so.

D

As you can see, there's neurons.

A

Firing all over the place, but it's pretty sparse and.

C

Your neocortex looks.

A

Is looking like this now as you're listening to me talking taking attention to this stuff, this is basically what's going on in your cortex. This is just x and y. It's just a go to the optical point so that basically.

E

That thing very, very small, yes, probably.

A

Like about 30 microns,.

A

F

Yeah, there's probably a few hundred neurons in here, but only a few.

A

Are happy about a time, okay, so and in general, if you.

B

Think about the neocortex.

A

Well, maybe real basics. How many of you know love the neocortex is okay. Most of you do is that the largest.

G

Organ in the brain see.

A

Our intelligence and all mammals have a neocortex.

A

G

Are many different areas there.

A

That are motor functions. Sensory functions thought planning all that self, but.

F

All of these different things that our.

A

Cortex is responsible for, but in the neocortex, sparse representations are critical across all sensory modalities and across all of these functions. So no matter.

D

A

Neocortex is doing is always easy, sparse representations and.

F

Sparsity is really.

A

Deeply ingrained into the neocortex and really.

C

Surprising ways that I'll try to touch on here.

A

In this talk and.

B

A

You can't really understand cortical function without understanding sparsity, so.

F

If you're interested in understanding nice principles.

A

And how it might.

F

Impact machine learning, I think sparsity is a real nice.

A

Place to start because there's.

C

B

A

And different machine learning it, but.

F

Especially is very.

A

Very deep in it when you're, for this ok.

F

A

F

Exactly is Varsity.

A

F

Basically, when something is mostly mystic, if you have some, you know some quantity of mentoring and is.

C

Mostly not there just a.

A

Little bit is there do.

C

You think about a sparse back. There are sparse matrix.

A

It's one where most of the elements.

F

Are zero exactly zero, so.

A

F

The neocortex, if you look at the papers, they.

A

Describe at least three different exits or sities, so the.

C

First, one is population sparsity,.

A

So that's kind of what I was showing there and how? If you look at a collection of neurons, how many are active right now and generally it's a small percentage of actually then there's this weird one called lifetime sparsity, which is, if you look at a single neuron over its lifetime and all the stimuli is getting come up and there's an actually.

C

Part and it doesn't fire at.

A

That often so that's that's a type of sparsity I'm, not really going to touch on that. It's a bill, ill, define it and close this application, because you.

C

Kind of have to average over all possible.

A

See for any given experimental paradigm, you ought to see a small so.

H

That's another.

A

Definition, that's another.

D

C

Connectivity's.

A

Connectivity in the neocortex is really sourced, so whenever you have a group of cells or layer of cells that projected some other neurons, if.

B

You look at what are the percentage of cells are actually physically.

A

Connected and that the connection probability is usually very small, so.

F

These are three types that are the.

A

Most common types, disease, but.

F

There's actually a.

B

A

More and I'll get to them if.

B

F

Actually goes deeper.

A

Than these things.

A

F

Population sparsity is the percent.

A

Of neurons that are fired right now, it's.

F

A little hard to measure.

A

You in order to know what right now is.

F

You know I get a window from the neuroscience perspective is about five milliseconds, but that's often quite.

A

Hard to measure many measurement techniques.

F

Actually, look at 75 milliseconds.

A

10 minutes actually.

F

Somewhat hard to measure.

A

I

The denominator, but some technique.

A

F

Still, people estimate that roughly about 25% to 2% of the.

B

A

Are active at a time very sparse far.

F

A

F

A

Example of a here.

C

This is a data.

A

That we collaborated with Spencer Smith and you North Carolina, he's now in UCSB. So.

F

What they did was they had this.

A

They recorded this video from the mouse's perspective to the camcorder and literally moved it around their page. So this is a naturalistic.

D

Video and what it.

F

Wanted to do was record.

A

Sparsity over time, how this partially changed as.

F

You know if you look nervous on papers long time.

A

They'll do kind of orient oriented bars or greetings, or some really simple things. This.

F

Is a much richer.

A

F

Interesting is it so this these graphs.

A

Show the activity.

F

Of cells, so what it shows here is, if you look at the very.

D

First time you.

A

See the video, the video of 30 seconds, long and they're recording from about 360 neurons and whenever each neuron fires used to get about. So.

F

During the course of this video, you see that.

A

It's a sparse response and if.

B

H

At the 20th Carl, it's.

A

Actually far sparser, so, even though it's the exact.

B

A

The atonia trial is much sparser.

E

A

To show the video so.

E

F

E

A

C

Is really interesting so.

A

B

Sparsity is not something saying it really changes.

E

It's very dynamic.

E

A

A in the rain, but you.

D

Can think of each presentation of.

A

Our city over time- and it goes down and so.

D

We have a couple.

C

Of papers and describe our theory.

F

And why this is, but for the purpose of this talk just know.

A

F

A

Not some big thing, it's very dynamic and.

C

A

Let's talk about sparsity and.

F

A

So this is a example.

F

Of pyramidal cell.

A

You can see that it's quite complex, so all of these really complicated structures called dendrites and that's where.

C

A

Neuron getting input and firing.

A

C

F

A

More complex than that and if.

F

A

F

A

F

A

Into small each segment.

A

A

And what happens.

A

B

F

Not a similar, very different from a machine.

A

Learning model and it's.

F

A

Linear way, to sum it's.

J

Like lots of little independent things and after.

A

H

Piece get input.

A

And then XY travels I will.

D

Go all the details behind it, but there is some information that flows.

A

C

Very localized sparks.

A

Yeah yeah yeah down here yeah.

F

A

F

A

K

A

That only a small part of the more honest updating as a motion right, so the learning itself is very sparse in Iran. So learning.

A

And then learning occurs.

A

A

It's almost like binary and quite.

C

Often the synapse won't.

A

Even become active, if input comes in so it's a kind of a stochastic thing, you can't really neuron can always rely on it and its impact is not quite binary, but you can think of it almost as binary again.

H

That's very different from neural.

A

Networks where you have 32-bit precision in your in your weights, so the synapse is a pretty uninformative and then the learning.

C

The way learning occurs.

A

F

So you could ask you if the synapse is uninformative, how.

A

How do you learn and well you learn by changing the network structure, so the cortex learns by adding and dropping connections all the time, and so the connectivity is very sparse, but the connectivity is dynamic. It's constantly changing. This is a result that shows how kind automatic this is. So here it's looking at one of these dendritic segments over many days and the red triangle shows these synapses of connections that were either added or cropped and what you.

C

Can see here is that 30% of the.

A

Synapses disappear within a couple of days and 30% or a village turnover, and so the network structure is extremely dynamic and it's changing quite a bit.

L

A

It's a it changes best for older people when you're young there's a huge. You know growth of stuff and lots of stuff dropping and adding, but it is constant throughout adulthood.

G

What happens? Somebody see oops key.

A

Yes, that's not a dumb question. People have won Nobel prizes for that. So there's.

B

A

Lot of consolidation that happens during sleep and that's a whole different I think we could almost have a meet-up on that I. Imagine how they don't know a lot about it, but I know there's a lot of synapse consolidation and changing and learning occurring.

A

So people think maybe a lot of this stuff happens, maybe during sleep so anyway. That's this is something that's not machine. Learning this.

F

A

Mystically shows this kind of happening you can see. This is a neuron and dendrites, and what you'll see is that eventually you'll see these axons that are actually growing and make a connection there's an axon coming in. So it's an output from another neuron, that's coming in and it's forming a connection. So this is the what our brains are doing every day, because these neurons are moving around and the outputs are changing and actions and dropping other connections. Things like that. So it's pretty.

A

E

I'm, wrapping up all the illogical that biological, it's.

A

A

Like that, it's hard to kind of do these movies.

L

A

Yeah, so it might be that so the basic principle is this hebbian, like principally, if you, if two neurons are firing together, they will tend to wire together, and so this is this growth process that happens in.

L

A

Of how far yeah.

L

A

I, don't know but I mean I know that you know some stuff happens within a day, but some could be. There's.

F

Many different timescales of.

A

A

Okay, so wrapping up a little bit, this is a very fast tourist varsity.

A

So one thing I was curious about is when I was doing these slides its. How many ways is the neocortex sparks, so I started off with three different types of sparsity subpopulations varsity means a small percentage. Neurons are active right now: lifetime sparsity specific cells, don't fire that often the dynamic.

B

A

Are are sparse and dynamic, so these little segments are dynamically connected to a very small percentage of potential contributions.

A

Then I mentioned sparse learning. So if we look at a.

C

A

A small percentage of the cell's synapses.

C

A

Weights are actually updated during learning, because I'm learning is extremely sparse. The.

J

Synapse weights.

A

Are sparse if you so you know, where is it machine learning we might deal with 32-bit floating-point numbers the.

B

Resolution, you know if you look at all the possible values it could take. It's a very.

F

A

Values is basically binary, so.

J

You could think of the resolution of synapses.

A

Being very, very sparse and the last one because of all of this stuff, the energy usage is really sparse. So only.

I

A

Small percentage of the neocortex is actually sort of using up significant energy. At any point in time, and of course, this is changing as the new cortex of learning.

F

Patterns, the exact.

A

Location they're changing, but at.

F

Any point in time.

A

Only a small percentage of what the energy that could be generated is actually been.

A

Always so, whenever these pyramidal neurons are learning it's only, a very small percentage of the synapses are generally being updated. I should.

F

Say always there's nothing.

D

A

F

Exactly so, a lot of it's called strain on those.

A

Segments yeah so wherever it recognizes sparse patterns, that's where a lot of the learning tends to occur.

A

Okay, so leave you with this since we're gonna switch to machine learning, so I tried to convince you that the neocortex has extremely sparse connectivity, extremely sparse activations, sparse learning, sparse weight, values and very sparse energy usage. I would say the neocortex is an existence, proof that an extremely sparse dynamic system can operate. It operate more intelligently than any dance machine learning system in existence. Today, okay,.

C

So we know these sort of.

A

Sparked is possible for such a really sparse system to do very one. It's an existence proof. So the question, though, is, is this really required? This is an interesting topic. I think this type of dynamic sparsity, like the neocortex, is required for building intelligent systems. If you really want to do have anything, that's reasonable scale, you're going to have to have very efficient energy usage to be able to continuously learn.

E

A

But if you think about like a GPU based system or something that's running, was it GPT or some of these good NLP system? It's just a huge energy decisions. No.

L

J

A

Well, you know I'm hoping that's not the case. Yeah I think we it's it's.

F

Already there we.

D

Could study it.

B

B

A

Favorite Panther science, major I, don't know the first thing about computers.

A

All right, so thanks for coming please this.

B

A

Beat up and we started this series of pinups about brains, part machine learning started last month. We continue learning and this months about.

A

So last time we did like a brief literature review. I'm, not an expert I was not an expert in for to learning and also not an expert in parsing, but I'm. The research in machine learning I went to the papers and it contains a lot of my opinions. Well in this beautiful reduce is really great to stop me in this.

A

First question: we shouldn't start with that for the first.

C

Question that whereas it's Rob.

A

Read about our ways and I'm gonna go talk to papers time from the club99. This is work was done. What if it was stupid, 18 P develops.

B

Lucas, do you put your hands at their bed screen.

B

Step right in front of it.

B

There you go machine.

A

I'm not organized myself, so Saeng organized and.

H

A

At first Berg, it's not.

A

And the whole idea was unimportant: waits for my network can I, have a network learn faster and generalize and at the time Cooney was mainly done by magnitude and still.

C

Than the magnitude today.

A

So the color proposed a slightly different approach using the second derivative. It's.

B

Trying to say that even three.

A

Years later, which to another language, so a lot of the time.

A

This is a lot of stuffs happening between, and this is a newer work on pruning, which is referencing most of the new world, it's by hand 2015, and he was able to reduce storage and computation by an order of magnitude without the fact in a twist, just by learning, which connections are important and 20 years is mainly the magnitude. So the approach at hand follows like a three-step approach, efforts, trains at our network and then you produce it, and then he trains, the entire network, transfer a fine.

F

Tones an eye product.

A

A

So here you work scrotum and you start to pruning. You're, proving specific parts of respected like an entire year in China, and a whole idea behind so to prove means that it's easier to use.

B

It in a GPU.

A

That's right, so if you modify.

C

Your structure.

A

As opposed to what they do, here's of Astroturf protein just remove like random.

F

A

Easier to use this new network and optimize it in hardware which from MGP you so a lot of what work and has been done that at the end of the day, you see you gonna have to run on the GPU. So it doesn't matter if you just wanna start accruing, remove a lot of connections.

F

A

Still doing matrix matrix multiplication, you're.

F

A

Let you do you won't actually get any efficient scheme if you're doing them. So in this word, but you chose is then the prune actor is more crucial. Sorry about the text in this corridor.

A

So what we found out, he suggests growing and he seems a form of architecture search, and he shows that if you prune the network and you randomly initialize the weights, you can get to the same levels of.

F

Accuracy from before, so if your work by friend.

A

And carving he's using and he resets the way to the initial values while different, comparing he just finally to initialize the weights and you can get the same type of package. So both of these works are showing that if, by the way, it's not that brother than as we used to think about, and so after locking.

F

A

To have the work.

A

But maybe it's follow up on the hypothesis and they show that you don't even have to be initialize the way to their initial edges. You could just initialize to.

C

A

And we keep the same sign.

A

So be home from the pruning after we have. This very will last week is called wait: agnostic, neural networks and.

F

The question he.

A

Asked is what extent neural networks alone of learning a perimeter of any solutions for a given task may when he approaches his problems, that he has one single edge. Wait for all the ways, all the ways have exact same value and it is not trying to network and he's betting. This network against the past, it's a report in 13 tasks and he's using the reward as a signal for a genetic algorithm that learns how to evolve his network.

A

So he was from scratch from the minimum possible and he gets the smaller architecture possible that, without using only using one shortwave is able to accomplish a task. That's.

F

A

Quite impressive, if he is able to catch on every part, including ask this date is not here. He gets about 500 for the reward, not aspirin the best network cats, 970 and thereby the Metro.

D

Gets activity he.

A

Gets how I learn.

A

A

A

C

A

A

A

A

C

B

A

Do this next very simple, they just proven.

A

And Magic eight outperforms ten servers and there's a lot of beautiful. So we.

B

A

This word primal stuff and one. What actually works for- and they proposed is frameworks of an emic spars.

A

And spacing listing work that work of the people they when.

B

A

Add new weights.

B

A

Have a visit which decides which layer is more viable and the more they which smaller have to get more rates at.

F

The end you can have age.

A

Spots actually.

F

A

Then there's the beginning: yeah.

C

Hey guys I'm Patti glad to be here thanks, Lucas very muddy week, so today, I will be talking about the paper. That's mentioned earlier, called deconstructing lottery tickets, its ton of other folks that over air loves, including Jason who's, sitting there yeah. So this work is building on the idea of your network printing and the lottery ticket hypothesis and sweet mail. Your network pruning is a popular way of reducing the size of merit networks and the typically follows a standard procedure.

C

First, you train the network to conversions as you would normally and then you want to remove per fluid structure from the network. So in this work we consider the type of pruning that's magnitude, base, sparse, printing or unstructured pruning, meaning removing in the weights.

C

But once you renew the way, it's you've damaged a function that the network has learned. So you need to fine-tune the network some more. So how well does this work? Well.

C

Lucas mentioned in high elevation: you can actually get. Networks are ten times smaller, with no drop in accuracy. So if friend, Ian works so well, why don't we just try and prove network from the start. The answer, of course, is that it doesn't work if you randomly reinitialize the weights and you train a print network. It does not reach the same accuracy.

C

So recently, this paper by Frank and the carbon called the lottery ticket hypothesis show that you can actually train and that printer work from scratch. So only if you maintain the same original initialization for different ways, so they proposed a variant of the pruning algorithm which I'll call the lottery ticket algorithm.

C

So first you randomly initialize the network and then you train it to convergence, and you prove ways that have the smallest fine of magnitudes. So up to this point is the same as what we talked about before, but they do this special stuff which rewinds the remaining weights back to their original initialization of values, and then they try the network from this point.

C

So we do it in the iterative fashion, we're repeating steps 2 to 4 iteratively and removing an additional person the way the network each time so using this procedure they show some interesting results here. The y axis shows test accuracy if the networks and the x axis shows the love of printing, which is basically the percentage of weights remaining, and here we'll look at a convolutional network training on c-- part 10. The black line represents the original accuracy of that network without any pruning.

C

So, as we move from the left side to the right side of the x axis, we proven that work moremore and we see that the pro numbers actually perform even better than the original network, and if we continue to prune the performance drops back down to match the original full network. But at this point we've aggressively printed network. So we can have about five percent of eights.

E

Actly the same every time or is it a different sample of the stuff I mean? Is it no bursaries? Are you? Is the pruning getting rid of the noise and focusing on there what's not noise, and then you order, basically in so.

C

The data is the same set of data, the order like different, but it's the same set of data. It's there were sees that over and over again during dinner,.

E

Is it better good, random I? It was on the same order. It's at the same date. Now it isn't. There doesn't matter it.

C

Might matter, but so so this is the band that you see is over 5 min and Max over 5 different runs, so each of those have different orders. So it's the phenomena is relatively consistent, so I think it's probably pretty robust to data ordering noise.

C

Yeah, so if we continued through the performance drops more as you might expect, however, this level of performance is only true when you rewind the weights back to its initial values. If you randomly reinitialize the weights before retraining, then the performance is much worse.

C

So, based on these results, they proposed the lottery ticket hypothesis that randomly initialized in sneer networks contain some networks. That's initialized such that when training isolation can match the performance of the original network, they call these some networks winning tickets and suggested that it's a combination of their initialization and structure that makes their training particularly effective.

C

A

That performance drop is so steep.

A

Talk about this blue curve that there is a there is a sparse two level that when you go below that, it's sudden across pretty fast so.

B

A

Get to a point that you have the optimal number of parameters and.

G

Theoretically, you get that increased accuracy from about 0% sparsity yeah. Well,.

C

That's a very good question so.

C

Some of the later stuff I'll show my hints I, wanted it what it is, but only.

C

Not a theoretical isolation is.

A

That increase a general thing. You see across a lot of datasets, so.

C

We only us in the original lines ever only looked at see pertinent, and this I'm saying there are new work that show you don't really see this on innocent and what you actually mean on imagenet instance of rewinding to original initialization. You rewind back to some initializing value at say, epoch. One.

B

C

Isn't really consistent with their hypothesis, but.

J

Yeah is this only for Sinan's and are you pruning me, the filters? Are you including the final there as well.

C

And we're printing each layer so.

A

Even the very first layer.

A

Very first one have a very tiny number of ways. Yes,.

C

Those con layers are print less than them, and we prune the last layer at less also, so the last layer is pretty. It needs to be less force. Nice.

L

Is that enjoy so? Is it because of the metric.

C

C

G

It's because normally in your classification, later you're going from, however many neurons in the layer before that to maybe just ten right, if your cipher I'm just going at 10 labels. So if you look if you're using 95% sparsity, you only have I, don't know the map, that's, but a very small number of parameters. So, if you're doing just, if you only have two deep parameters that final classification.

C

Cool so right, so we wanted to study that effect that they proposed by basically studying the three important components of their algorithm. So the mask 0 action news. Basically, so the way the crane is, you multiply weights by a mask of ones and zeros, and we multiply by 0.

C

That means you're creating the way- and this is referring to what we do to the ways that are proved, and then we have the mask criteria, which is how we determine which weights are not important, and we also have the mask 1 action, which is what we so starting from the first one motivating question here is as the values that we set the Pruitts to matter for the accuracy of the network and before answering this question, I want to take a little detour to tell you about strange phenomena.

C

We observed, while training these networks and I promise that will become relevant to that question. Later. So imagine you have imagine you initialize a network randomly and you apply it on the MS data set without training the network. How well do you think it would do? Well, if you don't try to network, you would expect no better than chance performance, which.

C

Similarly, if you apply of random ask to this network, it was still it just expect chance performance. However, if you apply a mask derive from the lottery to get out of them on this network without training the weights, you actually in a network that performs significantly better than chance.

C

So this may be surprising, since the only thing we're doing is removing certain weights from the network for my randomly initialize network. So we call masks with this property that they generate partially working networks without training. The under language super mask.

F

C

Why that's in our title mind it is super, mask this. Well, it turns out that answering our first question. We can also provide explanation for them, so the printing procedure performs two actions. It sets the privilege to zero and then during retraining we can decouple the effects of these two actions by running a simple experiment.

C

So, instead of sending the crew ways to zero, we can freeze them at their randomly initialized values. If the value of the Pruitts don't matter, then this should perform similarly well, however, we see that that's not the case, so if we send them for ways to their initial values, the performance is significantly worse.

C

So this seems to suggest that the value of the Pruitts do contribute to the overall performance of the and, and it seems that zero is a particularly good value for them. So to see why that might be, we need to take a closer look at the mask criteria. We're using. We can think of different, a skirt area as regions on a 2d plane, with the x-axis being the value of the initial weights and Y being valued. The final weights this looks represents a distribution of weights from human layer.

C

It's also because the initial value sincerely so the the mascots are using a lottery to get algorithms he's wits with the largest final magnitudes, regardless of what their initial values are. We refer to this as large final, so it sets weights with the smallest magnitudes to zero. So what this is actually doing is setting to zero weights that are end up closes to zero at the end of the training process.

C

So this leads us to propose the hypothesis that masking is training. The masking operation tends to move weights in the direction they move. It would have moved during training. So this provides a potential explanation for super mass, since, with a wall chosen mask you're,.

G

Setting a subset of the network towards.

C

Its final values to test this hypothesis, we can run the second experiment for any way to be pruned. We set it to 0 only if it moves toward zero over the course of training and we freeze it as initial values. Otherwise, using this treatment, we get networks that perform just as well as the larger ticket networks, even though we did not set all of the previous to zero.

C

In fact, if we apply this treatment, so all the ways, including the weights we keep, that is, we is the weight decrease in attitude, we'll initialize it at zero around. We get never set and performing even better than the large sequence.

C

So these results support our hypothesis and it suggests that they give mascius. Training may be a valuable perspective and then here is just the same thing on four different networks base of PVC similar patterns, except for and fully connected Network everything seems to work. It's honest.

B

C

So now we explored the chorus part: let's look at the master Terra breathing.

C

So are there other ways to easily select which ways to keep, or is this phenomenon unique to a large final mass criteria?

C

Even our view on asking this training an obvious thing we can try is instead of cumulus with the largest finer magnitudes. We can keep with that increase in 92, the most very training. So, let's illustrate it here, call it magnitude increase, and this mask argyria was basically explicitly set waist to 0 that moved most toward 0 during training.

C

F

Though this work.

C

Well, luckily, yes, as we might expect, so that's the green line compared to the original large final area, so this works and sometimes work significantly better and large final in the paper, we have also tried a bunch of different mass criterium, not all which we expect to work, but we wanted to check our understanding or at a good time. I won't really go through all of them, but we see some criteria that can't produce Walker tickets and then they can matric see the performance of the original network and a bunch that don't.

C

Yes, similarly, they have the same thing on the four different networks, and the thing to note are the stars which basically show printing iterations, where magnitude increases significantly better than large final, alright, so moving on to mask 1 action, so the lottery ticket hypothesis paper show that you have to rewind the weights back to their initial values.

C

But you have to keep the exact values.

C

The weights initial value has to component the magnitude and the sign. Is it a combination of the two that we must meet and enough for me back to this clause? He remember this: is the reinitialize the weights randomly so to see which component is important? We can try a variant of this where we reinitialize the waist, but then after that, for some ways to have the same sign as their original initialization and that's true by the solid yellow line here, and the black line is a baseline, which is random.

C

Freeing and you apply random, asked to prove the network. So you see that if you keep the sign saying we get a big increase in performance over the reinvents experiments, but still not quite.

C

So the other thing you might think that since we, you know that initial values since early with final values and we're keeping waist large final values, then the initial values of the templates may not have the same distribution as overall network, and that's illustrated here so the blue represents the cat's weights initial balance. So perhaps if we follow this distribution to reinitialize the ways that could work better and the way we do, that is basically you shuffling the values inherently.

C

So that also does not work and that's shown by the dashed line here. Sometimes it gets pretty unstable as long. However, if we maintain a sign, we see that the network works much better and it's pretty close to the so this seems like the sign. Yes, there is a pattern here, so the scientists me thanks.

C

So in fact we can actually initialize all the weights to a constant value and as long as we keep the sign the same, we get networks that work just as well as original our network. So this really shows that the relative magnitude of the weights, don't matter at all.

C

Yes, let me see a similar pattern.

C

Okay, so we can ask one last question: if you think of masking a straining, what is that the only training procedure we perform? How well can we do then earlier I show that using the large final mask I'll end this we can get accuracy close to 40%.

C

What can we do better than this? What is in some keeping weights with large fine of magnitudes? We keep weights, we keep a large final weights only. They maintain the same kind of training. So here is.

C

And this gives us a network whose.

F

Initial values have the correct.

C

Sign or at a sign that they would have had at the end of training, so using those simple criteria, we get pretty remarkable results.

C

We multiply this mask on train weights. We get, will see 80% on amethyst and up to 24% of zebra ten.

C

Interestingly, if you converted all the ways to constant values similar to the mass one experiments, we can never set up work even better up to 86 percent on CMS and three wonders housekeeper. So this network pays with me all the values here are here: zero or a plus minus proper tone, and we also wanted to see if we can push the performance of this by learning a mess correctly, so we're using resistance to learn.

C

Here we really can't you learn, ask x.

C

And now we're seeing results that are closer to what a fully trained network might see. If we use some more tricks, then we get even better results.

C

C

J

The same I did.

A

I

C

A

Grass peak with brown 26.

C

A

Like all the West, you have a performance, you have a magazine and they.

F

A

C

C

Our triangle, since long as purgatory, oh there's, my sweet.

A

C

So actually we can't directly control the sparsity.

F

C

Now it's not for any like applications, I mean.

F

H

G

C

May be an interesting.

C

You can think of this thing that the.

G

Learn massive covers.

C

Is a sub networks within.

C

Yeah, so the constant we're using is the standard deviation of the layers, initial values, pretty simple here sake and in this one we're doing something like where we scale the sine constant based on the sparsity level, and that gives us a pretty big boost in performance.

C

A

Well, this is maybe as much as you can.

K

L

The knowledge fine.

A

So that was large.

K

F

As you all may see, there just.

K

A little breakfast this is.

F

Probably a little more it'll.

K

Start pretty pad level, so in the interest of time, I will not spend too much time like on all the very high level motivation stuff, but you know typically give this presentation to broad range of audiences. You all are some probably something more technical once so. What hopefully, will be the further the beginning, a little bit refresher.

C

K

What GPUs and CPUs and there's a little history of it.

F

K

It's a fun presentation I like to give, but well definitely dive into the meat of it too, and since everybody did it all just get started all right: I'm, the CEO founder of rain, Durham or phix. We are just down the streets, the five minute walk and we have a few of our folks Jack our CTO, the Bronx, our first employee. You know how on Raj so who all came over and walked over to to join for this meeting today. So we build processors for artificial intelligence that are inspired with the brain and.

C

Also, just a little fun fact, one.

K

Of Jack's really and one of the books that Jack's read the Jack read five years ago, was unconscious the more than five years ago now, which was well as core inspirations to invent this technology, we'll get right to it. So our mission is to go the first hardware that can power brain to scale intelligence. So we'll talk a lot about scaling like neural computes and basically how the existing paradigms the scale and what are the limitations to that and how we are seeking to improve upon that. That.

F

Kind of fundamentally rethinking the core.

K

Processing, underneath all that works so we'll skip through this but a little bit, but we're founders and I know it's what the intimidating I'm her slides, been moves quickly. So.

C

K

Florida Gainesville Florida, we moved back here a year ago or aboard. We have the guy who course lies. Gps chips he's cool, we some of our investors, our biggest investor as a CEO open, and we like Tom mater last summer, Sam Colvin, as well as some folks that blocks sparse paper and our key partners who work with our TSMC and overly hi.

K

B

Is a long presentation I'll, probably not to.

K

Number six, but we'll certainly go to the first five kind of giving a history kind of the motivations behind. You need to have a fundamental paradigm shift in scaling and early computes and then talked about our hardware and how sparsity is really kind of the core concepts that underlies you know the motivations of our hardware and works so the first part we'll go through quickly but and it's Hardware injured in intertwine history. The first part we're pushing there me on at noon was called the mark.

K

One perceptron and it was like Frank crew, isn't live in 1958 and look like this.

K

Wires right that were or hundred of these, like optical inputs, that were actually randomly connected to another layer of neurons, but it was only a single layer of neurons and as Marvin Minsky then wrote a few years later. A perceptron and kind of you know all over it. You.

F

Know this was not. It was not very.

K

Capable of what it could do, even though the New York Times is like it's gonna go to space to take our world, and you know conscious so.

F

H

Did not become conscious.

K

Spoiler words, but you know over the course of the last sixty years, we've seen these herbs and flows of summers and winters, but one thing that we see across the board is that there was really kind of fundamentally defined by indicating power that people had available at the time always had really powerful imaginations and create for creative in thinking about the algorithms we could create, but they were limited in the hardware they had running it. So the first summer was for 56 to 74.

K

It was marked by this kind of untethered undirected funding from the governments in the US and UK that ended for our first winter and some 1980, and notably the MLP model that was named during the first summer. They had vocabulary because support was 20, words he's the best computer, random memory right and Hans Moravec during though that winter had said that we need about 1 million times better computers and funny enough, if you track Moore's law scaling, really quite they're, 1 million times right now. So.

F

The second summer was 1982 1987. This.

K

Was the Japanese fifth generation computing project getting really motivated by this imagination around robotics? Again a lot of excitements that you didn't lie to this industry around expert systems that was authorized in the eighties, but like 300 companies and disability dollar, industry I.

F

K

Really emerge as people thought it would and that led to more exciting winter until 1993, but.

C

From there we saw.

K

Really Moore's law scaling it up and the benefit of moore's law meant that people could build rules-based AI systems that really started to capture people's imaginations of what we could actually knowledge. So this was when DARPA robot when the DARPA Grand Challenge is worked for the first time with Stanford robot going out. Ibm Watson I was on Jeopardy those types of things and, of course, we get to 2012 or the GP for evolution. That's where we are today and that's what the modern deep learning is all defined around.

K

So part two we'll talk about AI, soak and today, like what are we using and why are they? Is this the de-facto hardware for trading and interests? You don't tell you but Clara. It looks like a spaceship themselves kind of churning themselves at you, the deep learning company and if you buy something from Nvidia today, it looks like this is, if you want use the top-of-the-line GPU for training networks in a datacenter cost about $10,000 and you're, using it to run neural networks.

K

So, but if we go back 14 years so when I was actually the customer, I was when I was 14. We saw your mind. This is the GeForce 360 and we weren't using your human networks you're using it to bind their graphics.

K

Power all the sudden, new graphics are rendering super well the super-high resolution, and this is why? Because the math, underneath those two things was the same and the architecture between 2005.

E

K

Just hasn't really changed very much, it's being used to perform matrix multiplication. This is the core, not non-cooperation underlies all Evers, so whether you're looking at graphics. In the case of that your matrix and correspondent, pixels or polygons, runs, then you can modify by some type of transformation or if it's in a neural network there are matrix, corresponds to the weights right and that vector is. The activation is moving from one letter to the next, but.

B

K

Cases this is lots and lots of small complications that need to be paralyzed and will get you to the kind of what those architectures look like and how they scale. But, of course, in 2012, were really LED and kicked off. It's a GP revolution. It.

F

B

Not in 24 was not.

F

The first time.

B

K

Uses GP to train neural network, oh, but it was there's the first time someone used multiple GPUs to divide the training and and use that to paralyze to speed up the training to their own networks. Did you have a larger model? This was Alex Nets and this broken edge worn by so this really kicked off their renaissance for deep learning that were in them today and since then, the amount of computer throwing that these models is just increasing like crazy. So.

I

K

Is a shark's, it came from Oconee on and this was from their AI and computes blog post I. Think in the end of when we see from Alex in 2012 to alpha go to zero over the course of these seven years. It was a three hundred thousand increase.

K

D

K

Doubling period of ever three and a half.

C

K

With more neurons, more parameters and improving hardware, just really because, even though we can support these models, the number of GPUs that reducing exciting do you say this is orders of magnitude more the number of GPUs.

K

D

Else it emphasizes this bitter lesson.

K

Around and computes so.

F

The summary of this is.

K

Ass, while variations in architecture can, if incrementally improve model performance, it seems if you just throw more compute at a model maker model bigger, it seems to perform better off the board. The exact quote from rich Sutton who's, the father of reinforcement, learning and he might open up at Kennison advantageous for him.

F

K

70 years, a researchers at general methods, lever, computation.

K

F

K

And video is the current winner because they built a processor from another application that was originally intended for graphics and he very easily then just become the and the computer.

C

For these most.

K

Complex models is that one, every three and a half months.

K

So part three is a little bit of a isolation. What lies ahead, this.

C

Might be also a little.

K

Bit of review to a particular all you folks are in the machine learning world, but they're with me, so we're seeing every month these incredible advancements in works in the three areas that were most excited about it rain our contemporary models. You know this definitely captured people's imagination in the last few months, the face out while, but there are all types of debates and things that are interesting.

K

You know there are other interesting uses of this like companies that you generates new clubs, many types of routine structures that are effective at fighting disease. So, incredibly promising to me, of course, we enforcement learning this after the world's imagination with Health ago, but I just spoke with someone from the Google brain is saying that I need to find the paper tissue, but in case they're saying that impact arms at increased after see, there's a team that inclusive CC and and ninety percent from last year. So these are these reinforcement.

K

Learning models are good and, of course, natural language processing. So this is the transformer models burst. Gb teach you. It's really amazing to see what these models are capable of doing. Right now, both in terms of generative measures, as well as just questioning answering, but these are some of the biggest models that we have seen, and the issue is that these models can be boys.

K

That's and people.

C

Warranted at Google brain said.

B

Is the single most important.

K

Message I want to get across. Is there a lot of new applications, they're blocked from launching? Because we don't have enough so.

B

K

Take a look at just how far away- and we really are so we say that it's ugly, this hardware is neither stable nor efficient and.

E

F

Look at burst is.

K

One of these MLPs and for base models yeah, but 1.4 million neurons, 110 million parameters or synapses taking 1500 hours.

K

C

B

K

Watts, so we are five orders of magnitude away in sale and energy efficiency. That is a massive massive distance that we need to converse if we're ever going to build like really upon this sprays.

K

F

Another emphasis.

K

That we make and is a little bit. This thing is up, but essentially girl that were say our deep. We have very deep learning networks, but they are not wide right and we believe this is fundamentally does use and you use, handle and station about.

K

People wants to build networks that are optimized to the heart or they're.

B

F

On so we have lots.

K

Of deep, dense neural networks, not very wide or sparse numbers, and with enables negatives of them, but most obviously having a higher equal estates. You have.

E

Any more dimensions.

K

To be put into that first layer, you have value with enables activation sparsity with as a massage boys.

F

K

We're really excited about exploring with eliminating with keep for researchers or ministers to do, but.

F

If I'm mentally, we.

K

Believe that the future of neural networks must be body and source, as opposed to.

K

Part four o'clock: let's see on challenges so now that we know it's on the intercept, altercation why that is occur and where, let's look at what it will take for us to go from hardware, we have today to anywhere five organize your crews on energy and Co.

K

B

K

Will be a bit of a obviously for some folks, I I put.

B

This together, so I can very.

K

You know, obviously how matrix multiplications II used for their transformations, so everything, but you know we have activations.

G

In neurons at correspond.

K

To the sector, and then we have weights in matrix that corresponds to the.

K

K

That underlies architecture has processor, remember units and serial awesome. So if you're gonna do an internal application.

K

So you can see Julie. Why wouldn't a long time.

K

I

C

K

F

B

K

K

Run down the blinds.

F

K

So we wants to increase the matrix multiplication, that's happening, you want to increase your endless topless, but there was even attempted this most of.

F

The states not be consumed by.

K

A sort of your density of neurons, your.

C

K

Actually increased as you introduce us, so this sound organs for samus's, so we can't support a time uterus in this architecture.

K

So mad is to shoot we have all. The states is being consumed by swollen actin right voice, so what people do is keep their their positive response against.

K

K

All our multiplication, sport is just not that big, so.

K

So this is the worst kind of back to the question of always how raised today, mythologists or games get it there, but I see it or not these score operations and, of course we move to the brain for inspiration and the brain we see is fundamentally different and, of course, will come. You'll see that all of these points have a theater there's.

C

No fully connected regions, the brain.

K

Right we've heard this many times today. The brain is very, very, very, very sparse and it's a sparse.

E

Connectivity in particular, we see small world properties even.

K

Scale-Free properties in the brain, but that means well, while it is sparse, it is very well connected and.

D

Of course, he also have sports activations.

K

As we talked about today, so to summarize this parts, the scaling challenges of digital and analog approaches, digital simulations are scaling for them with time. Analog approaches are scaling for their space, but the brain scales with efficient viscosity. So now we'll talk about what sparsity.

F

K

Technology and how we see the path to brain scale, intelligence, so.

F

K

Emphasizing the value of small world connectivity and why this is kind of a very special type of sparsity. You know if you have two opposing paradigms of fully connected versus locally connected. If you're going to fully connect a system, you'll have a very short path: life, the path length of one from any point to any other point, but you're going to spend a lot on all the wires right, all all the connections. If you just want to be locally connected, you will save on your wiring cost.

K

You can have less wires in that in that whole system, but the number of jumps to get from one point and other points that large, so a small world network is kind of the and most of you, many.

C

Of you I imagine, have heard of this, but.

K

It's not this universal network topology, those of you cross almost everything in the world and it just emerges in so many situations. This is human social networks, electrical power grids, airplane mapping diagrams the brain, but it is what it is.

K

Is you have lots of these local connections like if you're you're likely to be connected to things that are close to you, but you have a significant number of a very small number of long-distance connections and that combination of lots of local connections and some long-distance connections means you have a very well connected system, and this gives rise like the six degrees of separation property on humans.

K

So when we think about our hardware, the key kind of insight that informs this architecture is that, whether you look at biological neurons or if you look at a random mesh of nanowires overlaid onto an even grid of electrodes, they both possess a small world sparsity.

K

This was a paper that we published last summer, so this means that we can build a new type of AI processor right, so instead of we're using analog computation, we're using voltages and resistances to perform this matrix multiplication, but instead of the crossbar, which had only your neurons on edges, we fill the entire chip with these neurons speaking of edge to edge close knock them together. So we can have a huge density of neurons, and on top of that, we overlay this random mesh of metal wires. Connecting them is a small world network so front.

K

Fundamentally we're scaling a matrix multiplication. These course shell, nano wires because of their 2.5 dean, kind of geometry, enable an interconnect density that you wouldn't be able to achieve using conventional CMOS, Lacock lithography techniques and.

C

Instead, we said we have constant neuron density.

K

Now is the chip size increases, so we can fit tons of neurons and have layer sizes, essentially in the millions where today we're seeing layer sizes and a thousand at the most so and I.

F

Like this, when I saw the picture.

K

Of the Markman perceptron I just couldn't not use it because.

F

C

Know when we went to tsmc.

B

K

We're like we want to probe and only connect, functional components on a chip before random.

B

K

There were definitely some eyebrows raised, but, what's what's so exciting is that they actually go back, and this was intuition. This notion of randomness, you know, is something that we that a lot of people have been comfortable with for a long time we randomly initialize weights, we can randomly select neurons to drop out and they randomly connected neurons in the mark.

K

So these are some cool images, but this is a more realistic rendering of the first chip. So we taped out this chip, we're calling it cumulus in July. So it is a 100.

B

K

Grid array of electrodes or neurons, if you will, which which are really just there to kind of source voltages or read, currents as inputs and outputs and then we're placing on top this massive dense mesh of nano wires. And when we do that. So.

F

This is also a.

K

Better rendering because it shows the electrodes, so we place the nano wires down on basically the naked pads and then we deposit metal to connect them. So even though they are stacked.

E

K

Can get connections through these columns vertically and have a huge number of connections per electrode.

L

K

These are 65 micron size and this chip was actually checked out in a really big geometry, so 189 here, but even with that, this is gonna. Then support like a 5,000 by 5,000 internal rotation.

K

So in emphasizing the key pieces of brain inspiration that inform our technology were, analog were aiming for low precision.

B

K

Been six bit precision that's been demonstrated, no more a large number of rear and resistive. Remember ster stacks that we're utilizing and, of course we are sparse.

K

So if we're gonna compare to a v100 GPU, which of course is state of the art for training, we do pretty well- and this is really just like our first coprocessor that we intend to commercialize, but.

F

If we're looking at an 80,000.

B

K

80,000 sparse matrix multiplication with about 99% sparsity we are and again artisan. That's will be at 65 nanometer cloning Stratos is the GP 100 was in 12 nanometer, but we're operating at two orders of magnitude faster than speed and power, and so we want to start at you orders of magnitude, but we have a roadmap to again.

K

Emphasize again the scale in comparison when you're working digital logic, you have order n-squared scaling in time when you're working with analog physics, you have order n-squared scaling in space, but because we are removing what we believe are the redundant connections. Fundamentally, in these networks we can achieve order and in a space and time and we're the only ship architecture that we know of that. Can yourself.

K

The first products that we want to bring to markets- this were actually gonna, be relatively relatively simple aside from, of course, the m3 itself, but really we want to build a coprocessor to enable really efficient, really fast, massive, sparse, matrix multiplication so which one do.

C

K

Operation we want to do really well and because we'll just be doing that one operation and we're actually pretty agnostic to the compiler ecosystem. That means we don't have to build these massive and, if CUDA layers for this first product- and we just want to get it out into people's hands, so we can start exploring. What can we do when this is so fast and so efficient and we're starting out with a 100x improvement on both speed and energy and and.

D

What I'd like to emphasize.

K

Of course, this is more than just neural networks. Right, matrix multiplication, sparse matrix multiplication can be used for a huge.

F

K

Of problems, of course, the motor the one we're most excited about and where we driver expiration is from our initial intelligence, but we think this will be used for a broad range of purposes. Not just works.

K

F

K

And just like that picture this core.

F

Shell structure.

K

So what this means? We have a metal core and a shell that we coax and we can create synapses that go through there. So our illustrators, hey.

F

Yeah, so this is this: something.

K

That's pretty interesting and I mean even a college on for a second to talk about number three.

C

But basically our hardware.

K

Supports a few ranges of models, and ideally even more- and this is obviously this is looking further into the future beyond just our first product, but initially we'll be able to demonstrate this in the next six months. Is this cold reservoir Cancun? Basically, you have a giant space of neurons that are randomly connected and it's you've been projected and put into there and then you put that input into a higher dimensional space. So then it's more easily separable by a linear classifier, but basically you don't have to Train any of your weights.

K

You don't have to have any control over the memory stirs, but it's not your a really powerful way to do time. Series classification we want to support back propagation. We do support back propagation, that we've tailored a type of back propagation that works just on an ARM chip, but the goal here is that we want to really just make it easy. So it's one! So it's just!

K

You can make layers larger without having to change the fundamentals of the neural networks, but in the long term we are really really excited about kind of bringing a new paradigm across the board, not.

C

Just about the hardware.

K

But also about the algorithms- and there is this- we just filed a provisional patent on this last week or there's a reduction of practice of an algorithm and we're very excited about this. But it's energy based models and what energy based models do it's? You define an energy function and you tie the minimization of that energy to the minimization of loss in your network and because we're.

F

K

A physical network of resistors we have a physical energy and it's the dissipation of energy on this chip. So much racing this patient of energy. On this trip on this chip, we actually can understand how to train it and where the we get the gradient for free, so to speak by observing the dissipation of energy, it's an incredibly powerful thing, the mode. The immediate consequences of this are that we would. We could be able to avoid the analog digital conversion from layer to layer which currently for analog approaches for networks.

K

You need to do that's incredibly energy energy hungry, so we can have very, very, very low energy consumption. Like imagine. A billion parameters on a boreback privity to you, but probably the most exciting thing here, is that by using energy based models, support a physical system and using physics, we now have the tools of physics to rigorously invest in those networks. We don't have a rigorous analytical tool, sets to really break down our cheap neural networks today, but imagine we have like Maxwell's equations, actually understand.

K

A

D

That's just math yeah.

G

D

Physical connections can't.

K

Change right right, the physical connections.

D

K

Set but the strength of those connections can be mastered right because we can adjust the presenters.

J

Yeah, basically, a lot of these energy. They have problems with inference, but if you, if you say that you can actually show, we actually just showed this.

J

As students right now showing that you can formulate a physical network and that physical network will minimize its power dissipation, so if you have an electrical network, it will naturally minimize its power dissipation, but it does so like practically instantaneously.

J

So the that's the amount of time that it takes to perform inference and energy based model, and that's the core limitation of these energy based models. It's so we can. So if we can make that that settle into that energy minimum much faster, we can potentially make these models useful.

K

So in the future we're going first to coprocessor with pieces of our PCIe slots, but we want to lasix for all types of applications, not.

B

K

But robots and we.

C

F

C

K

Become the next platform on which all artificial neural networks are built, we're certainly not wanting for ambition and yeah. We we envision a world where massive models doctor well, that's a world. We already live in already, because those are our brains right. We have Ilyas and billions of neurons and synapses and we do it. So why can't we so that's all I have I, do have like these market slides and they feel a little itchy announcing all this, but.

B

K

Happy to take any questions you guys have about at the hardware or.

H

K

Sue the matrix multiplication, so so we do. We have digital analog converters under the front that will take any type of input array and convert that x16 dips, and then we have analog converters back. We.

E

F

K

K

J

Can actually do it extremely you.

F

Can do it extremely efficiently? We.

J

Have all of our power consumption estimates include in DNA, so.

K

Even with that EDC GTAC stuff we're looking at starting our grabs, the.

J

End of the energy based model will remove all that.

K

B

K

Big things that we've been excited about over the last year is that we.

F

K

B

Every chip actually.

K

Is at random into being a little bit different, provided you have enough high enough density and even enough distribution of the waters across the ship, a majors in effective, effective resistances, and it's about the effective resistance is between any two electrodes and you can just nudge the weights as we do. We pulse voltage, is to raise the resistances of the remember. Esters are.

C

K

You kind of upload an arbitrary resistance matrix to two to one of these.

I

B

We deal with that by.

J

They're, always you can deal with the most naive ways to simply double the number without this and you subtract the two again and that way one acts as your positive weights.

I

J

It doesn't matter because the effective resistance into one.

B

K

That's part of the so currently we do those markets ourselves.

K

H

The whole idea here is that, if.

K

You're using masks.

K

Well, no, no! No, but even if you're, using a mask and reason with Agra feeds the number of layers that we wire we're just a seed by.

K

K

But I think we're looking to go all the way to.

J

Like one of them where the broad illusions is that is the width of the layer increases, you can see, we can get away yeah, you can also have aggragate, so you can see if you can combine like four electrodes into one RL and you can have one output and then three inputs, and then you can combine the synapses from each of those to dynamically adjust yours.

A

Now you can get by with the higher sparsity with more dimensions, essentially.

D

But there's still a.

D

J

D

D

Do you have a plan distribution in mind or they can be in this.

K

So it depends right, we employ the most intuitive way is actually alternate inputs output, because then you kind of have some topological preservation. First are.

J

Images so common sparsity actually is. I.

D

J

Get away with, to a large degree, fixed varsity if you exploit the topology of the deep. So if you look at con, Lucian's compositions are close, but what they do is they reflect a topology every nail. There's a 2d structure to your images and convolutions are just breaking that up in this farce, way to naturally operate on things that are most likely. So I think that if you, if you map, if you make your, we have this like 2d structure in if you create, if you like,.

C

J

Your data before you projected here into the into this map, I think that you can get away with basically sparsely there's a there's, a way to map this efficient and that's the way you should do it right now. We're just randomly mapping, because.

A