Numenta Numenta Research Meetings, 24 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Recap on Hot Chips Conference - August 24, 2020

Description

Kevin Hunter does a recap on the Hot Chips 32 conference, which was held virtually on Aug 16-18. He highlights presentations and talks that focus on the latest processor innovations and machine learning processing.

A

The premiere of local conferences on what's happening in high performance, microprocessors and uh integrated circuits, so it started in 1989 it's held once a year in august in silicon valley, uh tutorials talks and posters. They had, of course, to go virtual. This year uh there were 1250 registrants. Last year they doubled practically uh this year, so uh there's some hot stuff there, um so the tutorials uh they had uh two sessions on uh sunday, one was machine learning scale out which I'll talk a little bit about the other.

A

One was quantum computing, though it's interesting, that's for future stuff, so interests of brevity, I'm I'm not going to talk about that today.

A

uh The first two topic areas uh actually corresponded with uh uh are on monday morning and I didn't deem them important enough to miss research and stand up, but the keynote was kind of interesting um and uh I had to actually look at the the recorded version afterwards, because I also had a conflict for uh that. One uh and I'll talk a little bit about that. So uh raja kaduri cedar vice president intel uh graphics and it's got a bunch of titles there, but the title of talk was no transistor left behind.

A

uh There was a talk in edge computing and sensing. uh There's uh you can see the sessions there, fpgas with reconfigurable architecture, actually.

B

I'm sorry kevin the the red ones. Are the ones you're going to talk about.

A

Yes, yes, I'm going to touch uh touch on um the fpgas reconfigurable architecture. Actually the fpga ones were not so much interesting, but the reconfigurable architecture, one one was uh so the ml training and ml inference. uh Even though I went to it, I'm not actually going to talk directly about those uh today uh in the interest of time.

C

How much of the conference was about ai.

A

uh Considerable amount- actually, in fact almost all of it in in some sense, uh was related to it uh in the sense that uh ai uh machine learning is producing such a driving hardware to such an extent that everyone's kind of looking at that. So on the server side, there was some talk about that and uh but on the mobile, even on the mobile processing.

A

It's like how do you, uh what kind of ai features are being added into the things that are being put into your cell phones and such so, uh but I'm, I'm mostly was going to concentrate talking about uh uh where uh people uh uh where they see the future to be and uh where sparsi cropped up uh very few of the major players actually are pushing sparsity the one exception uh might be uh google and their uh excuse me nvidia their a100 architecture, but it's only fine-grained sparsity uh two-by-two blocks.

A

That kind of thing, uh of course, there's also cerebros, which I'll talk about a little bit later. They have sparsity to some degree, but they didn't get into that heavily as to to what level sparsity they actually support.

A

There are also three posters which uh I think we're uh they're reading uh uh when I get a pointer to where um I can uh uh put it up on the google drive I'll put up all the uh uh the pdfs for all this stuff. There they're also uh recorded talks, but it requires a login to get to them and if anyone you know, is interested in actually hearing the talks behind these things, uh I'll try to make that available to you in some way.

A

uh I can just you know, share my uh my credentials and tell you how to navigate to it, but uh these these three talks kind of looked interesting because they were related to sparsity uh and uh in particular the gampu uh one looks like they were trying to to deal with uh sparse bars to a limited extent, but they were dealing with it serially rather than in parallel.

A

Okay, so the the first uh uh tutorial area was the machine learning scale out so in, in their parlance scale up means more intense compute on the chip. uh So uh when you, you know double the transistor count or you try to be more, uh uh you know pack more functionality on there. That's what they mean by scale up, so uh in trying to kind of uh get to moore's laws equivalents. That's one way of doing it.

A

uh The typical problem there is that, uh is, you have a power barrier that they uh they can only scale up by basically uh keeping to a certain size chip and this packing more onto there. You used to be able to up the clock frequency and that's just too prohibitive power wise these now you'll melt. The chip uh scale out means distributing the problem. So uh one of the things uh that's becoming more and more uh prevalent now is uh rather than just having a single chip.

A

You have a what's called a silicon interposer which is kind of like a piece of silicon with uh kind of uh mating wires on it, and you drop little chiplets on there uh to add the functionality in some cases, they're all symmetric some cases. You have heterogeneous chips, so you can mention a max capable mix and match capability.

B

Are they doing that? uh Is that actually in production now is this? Oh.

A

Yeah, oh it's! No! I mean uh your uh your uh uh uh xilinx chips already. I have that right now I mean it's. It's it's! It's a it's a that technology has okay, all right for these ten years, yeah! It's it's just a question to the degree that they're relying upon it now um because the the big problem is that, as as you make the chip larger, your chance of hitting a critical defect is, is greater. So the idea is that okay, we box.

B

Everything off yeah I mean- and this has been talked about for years- I just didn't- know they're actually doing it. Absolutely. Okay,.

D

I guess the extreme version of scale up would be cerebros where they had the whole wafer right.

A

Yeah, that's not even a chiplet, that's they just went to the entire wafer. You know monolithically across the whole thing so, uh rather than piecing together a good dye, they just route around the ones that have failed.

A

So you know exactly and I'll talk about ceremonies a little bit in a bit, so the kayaking is kind of the chiplets on the interposer die and then multiple chips per board.

A

Multiple boards per rack slot, where they kind of slide in those drawers, and then you know, multiple racks and now we're talking warehouse level compute where the entire building theoretically could be dedicated to one problem, though in practice that's not the case, uh one of the other things that I've noted is that they're starting to put computation network switches.

A

So the communication paths between all these servers uh now are it's not just smart they're actually doing some forms of computation on them in a way of either compression or some kind of processing on the way to some other.

A

Gpu or server in some place, so there are actually companies that are now building these. These network switches that have computation built into them, so the nodes themselves are compute units, not just uh uh the source and targets.

A

uh So most of the talks were actually uh case, studies and scale out. So uh unless you want to you know someone's interested in the hardware details of how they approached it, uh I didn't think that would be worth covering this talk. um So the uh one of the talks that I thought was worth uh uh looking through, at least the slides of is the fundamentals of scaling out uh deep learning training. uh They did a very good presentation of what types of operations you need when you start scaling out.

A

You know beyond a chip or a board. What kinds of parallelism do you can you do? I mean the first thing everyone does. Is data parallelism where you have 70 units that do you know a vector multiplies in one parallel shot, but at some point you run out of uh capacity uh on whatever chip or board that you have, and so then you start subdividing in the model, so you can either.

A

If we think of our ideal models, where you know they're you, what you would do is take a piece of the model and execute it all the way through on one ship and then subdivide that. So, if you have, you know, uh you know a thousand, you know uh size vector as a as a kind of tried example.

A

You might divide it up into uh pieces of a hundred and then execute the whole model, including boundary conditions through the whole thing, and then we combine it at the end, the other one is just pipelining, taking each layer and operating on a different chip, and then shipping the intermediate results between shift to chip and then all combination of these above now what's interesting about.

A

That is that when you look at how you want to recombine the results, there's various parlances of these map reduce operations where you would sum them all in in one case where, if you split up a matrix uh multiply, uh there's he basically goes into uh what types of these mapreduce operations you would use that you call it a higher level both for the forward pass and sometimes it's different for the backward pass, and I I think it was a really good presentation of that.

A

It opened my eyes to uh what was possible, um so uh here's uh cerebrus they actually had a talk uh in uh outside of the tutorials, but uh it was basically a more or less reiteration.

A

So uh numenta has had a relationship, obviously with cerebrus. So uh this is a way for scale engine. That's you know the center section out of a 300 millimeter chip. Excuse me a wafer, uh so I mean those are the stats. I mean there's incredible: 400 000 processing elements uh they claimed that they experienced no delays talking from one corner of the chip of the wafer to another.

A

So whatever they've done to design the thing, uh they don't consider communication to be a barrier that it takes you longer to access corner to corner than adjacent, so they've designed it to kind of make to to balance all that, uh but 1.2 trillion transistors.

A

uh You know the amount of memory I mean these are the stats here. The interesting thing is, as I mentioned earlier, they've supported some form of sparsity. They just didn't get into it a lot in in this talk, but they also provide examples uh similar to what I was showing to you before about a data parallel model parallel, the two forms of it. They actually show how they can map those onto uh with their compiler onto uh pieces of the uh of the uh uh or processors of the chip.

B

Kevin yeah, typically, uh this kind of thing has a problem with heat dissipation: are they relying on sparsity to reduce heat, or is that some reason not a problem.

A

That's a good question and I don't have a good answer for that. I mean, I think, when those things are in there, they've got a giant heat sink on them.

B

Yeah, but you know well traditionally that was one of the major problems of something.

A

B

Even with giant heat sinks, so, but you know by having if they were relying on sparsity, it would be an interesting idea that sparsity not only gives you these computational benefits, but it also gives you these physical heating benefits. I.

A

B

Didn't they didn't talk about that.

A

Yeah uh they didn't, if, if they, if they did, I didn't uh I didn't twig on it. um But the thing is is that there is a notion when you talk about this kind of level of integration. There is a notion, what's called dark silicon, where they can't afford to turn on everything on on any one of these chips and they turn off pieces when they don't need them, and it's possible that you know that sparsity gives them a leg up on that, but they also uh you know if they ratter around something.

A

Maybe they'll use pieces of each of processors they go through, but it is definitely a huge problem. If, if this entire thing was dissipating power, it would melt so uh yeah good point. uh So take a take a look at this. uh Can you see my cursor up here by chance?

A

Okay, uh so take a look at these stats and I'm going to show you the only they would say about. They have another generation coming down the pike?

A

Okay, basically, it's 2x, uh so they're going to be going on the most advanced tsmc process and uh they were kind of cagey as to when this might actually uh uh uh be when they're gonna actually do the reveal of this. But what.

B

Is it are they in? Are they? Is these chips of this system being deployed? Is it deployed now, or is it still like in samples and people experimenting with.

A

No, the uh the the the first version the first generation is available uh as on servers that can be accessed publicly. I believe.

B

Well, I guess the question is that being used, has it been proven? um You know it's like you know, google ordered a hundred thousand of them are always it's more like hey. You can try this out type of stuff.

A

I don't I don't know their uh uh deployment level, uh I can. I can try going through the uh the slides and get an answer to your question. Yeah.

B

That's just interesting at all yeah I mean.

A

B

Question is okay. This is such a bold move. To do this, I mean way for scale has been. The idea has been around for a long time, and but you know this, these people make a lot of noise about it. The question has it proven itself commercially that such that people are welcome. It's like still like we're.

A

Hoping we're going to sell a lot of these things.

A

Maybe it's a uh uh pollyannish uh a look at this thing, but uh I mean they could just be sucking down investment dollars, but the fact of going to a generation two indicates.

B

Well, that sounds like it. Doesn't that doesn't tell you too much.

A

No, I I'm saying that's why I tried to qualify that uh yeah.

B

All right, if we don't know we'll, find out soon enough.

A

C

When we went there, they did mention they deployed, they had partnership with universities and all, but it wasn't clear if they deployed commercially yeah.

B

But it was just well it sort of reminds me of the you know the stuff that uh the human brain project was doing the car holdings myers project in germany. You know they had these things. It was available. Researchers were working with them, but you know you could log on and use them, but it wasn't just like commercially useful yet and so there's a big difference between those. I don't think we have to spend more time on it.

A

Okay, uh so the other uh talk I found interesting was uh talk from uh google uh with uh they're talking about g-shards, so the concept of sharding is anywhere anytime. You take a problem uh in machine learning or whatever in some high-performance computation and basically break it up into pieces, and then scat distribute those pieces uh across physical boundaries to say other uh chips to other uh boards to other servers and then recombine and pull the results back so uh so they they.

A

Basically uh this this talk was describing the language they had for uh dealing with that and how at google it it. They have a flow from defining a model in uh uh intense overflow and working it down through an optimization layer down through their optimizing, compiler and then deploying uh outwards uh requires partial annotation uh with pragmas in the code, to kind of give an indication of how you want it to uh uh break apart. But uh there's a lot of that. That's that's automated and that's what they're kind of highlighting I mentioned.

A

Also, a tens torrent uh is a company I'll talk more about them later and there have an express goal for automatic uh process, uh starting from just a pi torch api and then distributing things all the way across uh infrastructure of whatever you have available.

A

So I think there people are trying to lower the barrier to uh handling larger problems. One of the reasons why is this was a slide I pulled out of one person's talk of uh these. Are these uh transformer-based uh uh guys that uh we're starting to show an interest in, and uh you can kind of see that uh this is uh exponential growth of this, so so around uh in 2021, we're going to uh actually already start to see uh trillion, parameter models and it'll just turn upwards into.

A

I mean it's, it's it's a it's a frightening curve uh if, in fact, it doesn't be over at some point just due to availability. So there's a lot of talk about how in the world do we possibly handle this. This uh gargantuan growth in uh in desired capacity for these uh giant models.

D

So the y-axis here is uh the number of parameters.

C

D

Weights or synapses in our in our yeah.

A

Exactly sections and in billions.

D

In billions yeah, so it'll be at 10 trillion soon. Yes, in another year.

A

And a half yeah, exactly and and so this is the appetite you know so.

A

One of the other talks in one of the keynotes they uh we're talking about. You know moore's law and how these things are in different exponential uh lines. You know uh capacity more is long, but not, and one of the uh quotes out of there was that uh uh what I found uh amusing was that the number of people who are predicting uh the death of moore's law doubles every year.

A

C

D

It's roughly in the order of magnitude of you know, year and a half the number of synapses in the brain.

B

D

Obviously, the architectures are completely different and the neuron model is totally different, but it's you know at a very coarse scale. It's roughly the size of a neocortex human neocortex.

B

Yeah, what I find interesting about this is that the um you know, as you point out, people have been predicting the death of moore's law for a long long time going back to the 80s and- um and it didn't happen, and why didn't it happen?

B

Well, they just got clever more and more clever on how to use cmos and making smaller uh features on silicon chips and so on and um and every time they did that, of course they actually reduced the power consumed by individual transistors, and things like that here we have it just sort of a different issue and the question is you say like well, you know people, let's assume people are wrong about that. You know this can't keep scaling.

B

What will be the solution that prevents it that allows it to scale beyond where people are imagining they're just saying: well, we just do more. What we're doing now? It's not going to be able to scale forever, and so, if you think about that, I was just going through.

B

You know one possible answer: one possible answer is varsity and um I don't know if there's others, but if that's true that says, varsity will be at the center of all this stuff going forward that it will be the uh the absolute requirement to continue moving this direction. It just puts.

B

I don't know if there's another possibility, that's the only one, I'm aware of at the moment, maybe kevin. You know someone else knows some others, but it's just interesting to ask okay. How would this continue? You can't just keep doing the same. You have to do something different.

A

Yeah well, both of the keynotes basically uh took different tacks on that. I'm covering one of them and the the other one is is, is available too. So one keynote was from intel, of course, uh uh just kind of a as a spoiler. They they see a a path to 50x increase in transistor density uh and uh google took the other approach and they basically uh because.

B

You know, can we stop here for a second yeah.

A

B

Is a power problem right unless they have some magic that I don't know about.

D

They have that that.

B

That might really be again hinged on certain um optimizations of the sparsity, not just that they can make the transistors smaller.

A

No, I think they were looking at well. Okay, they basically have the third dimension they can go into, but.

B

That's still a.

A

B

A

There's still a power problem, but that, if, uh if you can stack them indefinitely, if you can interpose cooling layers in between all of those things, so in that sense you could keep on going.

B

That seems that doesn't seem like a great solution, but maybe that would be.

A

Well, your power dissipation is a function of uh surface area, and, if you can, you can basically, instead of being at a solid block of silicon, if you can break it up and increase.

B

It I I understand that um it's uh yeah well, I I interesting okay, so that was one three dimensions with cooling right. What's that.

A

What's the other basic.

B

Idea is, it is.

A

They're, basically, there are ways of uh stacking transistors as wires and stacking them vertically in place kind of, like you know, taking finfet, but going uh and and stacking in that the.

B

Other isn't that still doesn't that still beg the the question of power.

A

It does, but since they're extended structures, uh they can actually probably uh force either air or liquid through there to do that. The other thing is uh is kind of the chiplet idea and they're they're. Basically looking at uh I mean it only gets you to 50x. I mean, if you think about that, that's that that's that's only uh you know five doublings or or so so it's it it. It kind of runs out of steam.

A

At some point, uh google's approach, on the other hand, was that what happens when you run out of uh out of room at the bottom to kind of use, fame, feynman's uh kind of quote and they're looking and see how they can go in the other direction. You know to to uh to vast arrays of things and then being smart about that yeah.

B

But that's still, that's still your your uh accumulative uh uh power is it's still going to go up. You know.

A

It it is it's just that it's it is, the power will go up, but it it's it's. Not uh it's not going to be a nuclear core.

B

No, but you might you might have you know, you know some town dedicated to one running one model right.

A

Right, no, I don't I don't. I.

B

Don't view that, as a I mean I don't know, what's gonna happen in the future, maybe that will be the solution, but it seems like I'm not a very good one. Yeah.

D

So jeff yeah, I have another answer to your question, so you know you mentioned sparsity uh as one way we that'll help give. Maybe.

B

D

Or two orders of magnitude you know improvement in the scaling.

B

D

um I think another big thing is going to be the cortical columns idea.

D

The reason is that um if you have the thousand brains, theory kind of idea and you each cortical column, builds a model of the world of its world and builds a predictive model of the world. I think that can be much much more compact than a system. That's simply storing every possible combination of things, yeah, absolutely right. In fact, and that's going to give- I don't know how many orders of magnitude yeah so.

B

That's that's that we we might be right, but that might be you're right and I I guess I was thinking like. Oh these people are building these traditional, deep learning models. You know we're going to take. uh You know these language models and just make them bigger and bigger and bigger type of things and um uh but you're right. That's a different breaking point. That's like yeah, throw out all those models and build them. This.

D

Way, I think that's kind of the hardware substrates which are improving, which is.

B

Yeah yeah, I was just looking at algorithmic.

D

Improvements or conceptual improvements that could add several orders.

B

I was looking at this chart and it's like all about gpg2 and gpg3, or maybe.

D

B

And something go ahead, given those models, how they're gonna scale that that you could scale for some reason or sparsely but you're right? Ultimately, you might have to abandon these kind of models, and just you know, you're right. Of course the brain is is a super monstrous model and it doesn't take much power at all.

A

Yeah I mean no one's no one's looking right now at at the the kind of moral equivalent of a dead gender tree with 50 000 synapses on it yeah you know, and what kind of what kind of hardware do you need for that? You know I mean we've.

A

We've talked to ray newer morphix, which you know has that you know very wide, sparsely thing uh there's also, uh and I'm not gonna cover them, but there's also photonic solutions that uh manage to uh to do some amazing things, one of which is that if you basically have coherent light hitting a uh like a diffusion screen, a diffusive media that basically is kind of a that's your random sampler right out. There.

B

Yeah yeah yeah, so well. Let me actually see how this plays out. I mean, to one hand the extent that people can continue doing this stuff just by doing clever pieces of hardware or the optics, or you know that that's not great for us the extent that they really hit some that the way they're going to keep going is based on sparsity and a thousand brain theory, and so that's great for us.

B

So where that I no doubt that in the end we'll be moving from the brain models, but how long it takes it's an interesting question. Yeah.

A

So there's a natural limit, I mean uh I mean we hit 300 millimeter uh diameter wafers. You know uh how many years ago and they haven't pushed beyond that point because uh you know to build a fab for that is on the order of like three billion dollars. Yeah.

B

A

At some point you just say no.

B

All right, it's just it's very fun and interesting, just to observe this happening.

A

Okay, well, you you love the next few slides and we're you.

B

Know we play an interesting role in this whole future. It's.

A

Yeah well, the the the next uh thing. I only have a few minutes left and I've got 30 slides on a 9.. So.

B

And I, by the way I have to get back to my book, you.

A

Know I understand that so so the next one are more g, whiz slides. So uh so let me just go on to there. uh So uh this was the first keynote, so he lays out the scope of the exponential challenges. The computation needs, the power, the fact we're generating a huge amount of data, and you know whether we want to recycle that back into machine learning uh is an interesting thing, so he lays out this road map. I don't show the road map here.

A

You have to look kind of look at the slides because it's it's it's probably about you- know 12 15, slides to show that road map, uh so the no transistor left behind was actually quote from david blyth, uh who also spoke at the conference. uh So he just wanted to give. You know credit to where that came from.

A

That's the idea that you you try to make each transistor do something useful all the time if possible, you know so that you, you don't waste resources, so uh he went through and there's a series of slides, but this was the culmination of basically uh various levels of technology nodes where you digitize everything. You network everything. Everything gets onto mobile everything's in the cloud and then you get to exit scale where you have 100 billion intelligent connected devices. You know and the amount of compute that's associated with that. So that's that's a g with slide.

A

B

Think the intelligence is not referring to the kind of intelligence we talk about each other, uh it's not or is he talking about ai? Is he I mean true ai, or is he just saying well.

A

They're trying to push smart stuff to the edge, which I think that's what he's kind of talking about, but I mean he has some slides where on on a a graph where he showed you know where we're not anywhere close to uh reaching, and it was logarithmic in both dimensions where it was uh human intelligence and super intelligence he's just going on. You know synapse numbers and stuff like that, but I I figured that was.

A

That was not a really qualified slide, but here's the uh uh charmaine, if you didn't know what moore's law was that's basically uh uh uh every two years the idea was, was the uh the how uh how much more compute power you can pack on on the same area on a chip and that's that's been driving uh technology, at least in the digital realm, for quite a while, but now uh with these models, uh we've gone to, rather than a two-year doubling uh we've gone to a 3.4 month, doubling of required capacity, and so also the data is exploding.

A

The data that we're generating and we kind of go out in the world to get data to train these things. That's getting up to uh you know by 2025. It's pretty 175 of zettabytes, uh that's what 10 to the 15th, something like that, if I'm right um anyway, uh big number but his claim was there's still plenty of room at the bottom, and that was the 50x, so a couple of papers. I just want to go through uh quickly uh this. This was kind of interesting uh for a couple of reasons.

A

So, uh basically the claim is is that beijing inference uh is not easy to do on conventional processors, so they built their own.

A

uh The idea is that uh uh what you're trying to do is if you're trying to do something predictive uh using uh bayes rule and bayesian inference uh the fact that, rather than having a point estimate, you have a you're estimating a distribution and in cases where you have incomplete data. This is you know, given the available information, you have here's your best guess at doing stuff, so they claim to be the first silicon accelerator for bayesian inference.

A

They did a hardware algorithm co-design with parallel. uh Is it uh something something on to carla always uh uh markov chain, uh and so they applied it to some uh uh unsupervised tasks, and so this is what it is now. Here's the thing that I find interesting: they use chip kit they actually uh they have access to this technology where they have an incredibly short design cycle for something this is from rtl to tape out in three months by five people.

A

I mean that's, that's incredible for something of of of this magnitude and if that's what's you know if that technology is, is really uh available, I think it actually is uh it's gonna it's gonna eat into the fpga market. I mean you, look you look at how long it takes sometimes to get to bring up the fpga with all the algorithms you want, and part of that is you're. Trying to you know, work within the constraints of of the fpga chip.

A

But if you just use standard library cells and do place and route put things exactly where you want them in exactly the right type, you have a more optimal design. I mean it will not be as as it won't be necessarily reprogrammable, but at least it allows you to try things out uh uh with relatively low overhead.

A

um You know in terms of resources, so this is just a quick picture of where uh they started with the input where they've knocked out some data or made it uh credited up and then, uh after some number of iterations here's how much it uh tries it out this one. You know you know it didn't, have much to really pull this together. So I don't think that against the baseline, it was doing you know a hell of a good job for some of these other ones.

A

It was doing not a particularly bad job, so it depends upon the problem domain.

A

Okay, tens torrent, so they had a uh talk, uh neurons versus nand gates versus networks to find the right, compute substrate, artificial intelligence.

A

They have, I think, a complementary or not complimentary, but a similar philosophy to uh to the manta in some ways except they're way, more hardware oriented, so they were found in 2016.

A

They got 70 people, they have people with a spectrum of backgrounds in uh in architectures, uh their ml inference and training, training, they're looking at anything from edge to data center, and they want general purpose high, throughput, parallel computation, so here's their particular talk of where we need to be where ml compute demand is- and this is uh this- is the the moore's law for clusters and then for mega clusters, which would be like warehouse level uh and uh where we have to go and where moore's law is going to take us.

A

So you know stating that there's a there's, a problem in this match, so their goal, I mean ambitious, is largest clusters ever so they want networking computing to chip and they basically want one device in pipe. In other words, they want to take pie, torch, define your model in pie, torch and then have it scale out seamlessly across this level of hierarchy.

A

So that's that's their high level ambition.

A

Now. What they have interesting is that they have this notion so.

D

So uh are there a software system? I couldn't quite understand it. When you say clusters, is it just a bunch of servers and they're going to do the software for.

A

Data they do they do with software and hardware, but they're they're they're, basically starting out with this module trying to show they have a unique architecture and they're trying to show it scaling on out that.

B

And what does that module do? Is it like a gpu type of module or something.

A

uh It's uh let me jump ahead.

B

And it could be like a gpu with some sort of communications skills.

A

These these are in some ways. It's like a cerebros in the sense that each of these are individual processors. You have memory on the side you have I o coming in here, uh and this is a I guess, a connection network. Well, that's.

B

A gruesome name, that's.

A

A funny name for a product- greyskull, that's that's! That's that's just their current incarnation.

B

It's not that's, not a warm and funny warm and fuzzy.

A

Yeah well uh so here's how they kind of map things on out where they basically take these guys into groups to define various stages. So if you remember the uh model.

B

Right so they're building a chip, they're.

A

Building a chip, it can be that.

B

It's designed for this large scale deployment.

A

Yes, I mean they're they're thinking ahead to how you would scale it out all right, and so they uh they basically say how do we uh uh uh map pipeline parallelism and model parallelism? You know onto these array processors similar in that sense to cerebrus.

A

So jawbridge was uh the first prototype greyskull's their current one and worm holds the next one. That they're going for uh the noted fact here is this one's got: 600 port 16 ports of 100g ethernet one. This is designed to go into a uh a network cluster with an integrated network switch, so they're they're, basically proving their technology. One step at a time why? Why did you say this is.

B

Sort of like what what they're doing.

A

It seems much more like what basically I was answering your questions. I skipped over the slides that would actually uh speak to that. So.

B

B

A

B

A

Okay, that's interesting, so this dynamic execution uh is where they can actually uh have this control flow. Here they do a sparse compute. Here they have dynamic precision, you use the precision you need and there's runtime compression of weights and activations, which is very similar to what we're thinking about like.

B

Runtime vacation grows right.

A

Right and and compression I mean we're moving in and out of well.

B

Once you go sparse, I assume compression comes naturally from that is that.

A

Well, in in in in some of these things, they don't they just you know they they they basically take the sparse representation thing blow it on up and fill in all the zeros and send it to their vector unit. You know so that everything lines up so not everyone, you know, does it smartly.

B

A

uh So they're showing uh the matrix multiplication uh they're, showing the results coming out of here as as sparse, but these are both now what was weird- and I I I went on chat because you can do that on these things. With these guys they had a activation, sparse, the weights were dense and the results were sparse. Well, okay, but they were gaining. You know 100x max boost they weren't taking advantage of weight sparsity.

A

Somehow he had this notion that if you had a cascade set of these sparse results, somehow you'd want to multiply them together, and I asked her what the use case for that was, and I didn't really get a good answer and david cantor from google. You know agreed with me. He says I thought he was. He just meant this to be sparse as well. So spar sparse, you know, you know uh get you know new sparks.

D

You may be thinking about transformers where they do a multiplicative thing between two activation patterns.

A

Yeah yeah, so if you can get it you know with with with dense weights, you know sitting there, then you know cool we're thinking. You know this could be sparse as well, so maybe it becomes 1000x. Who knows? I don't know.

D

But I think your point is still valid because I think the bulk of the compute is actually the stuff going on against the weights and there you're not going to get the 100x.

A

Right exactly exactly.

B

But well, this chart just on the surface, suggest that they think they would right. I mean they said 90 and then 100x so you're saying.

A

You don't believe that, well, he say only in the use case of transformers where you're multiplying uh activations, together as opposed to uh weights.

D

Yeah here they only have activation sparsity, not weight, sparsity yeah. I understand that but you're.

B

Saying I I guess they don't understand, transformed as well enough or at all so you're you're, saying basically they're not using this sparse. This dense weight measure is that what you're saying they're not.

D

They're not uh no most so here they, they can only do dense weight, matrices and the bulk of the computers and the multiplying against the weight. So you would not get the 100x there, but part.

B

Of transformers.

D

They're multiplying activations with activations yeah, so.

B

So this this this figure, I'm looking at here on the right is you're, saying it's misleading. Is that right, yeah.

D

Yeah, okay, I think so all right.

D

Well, I don't know what they meant, but that's my interpretation. I.

B

I I am very apologetic I am going to have to drop off here because I just have a deadline today. Okay, so I can. I can watch recording again later. Okay,.

D

Okay, well, we can go on for a few more minutes. Yeah.

A

Yeah, I'm uh actually I've only got three more slides after this, so um so uh they claim they have full uh high torch integration. uh They uh saying you know, you know which flows they can take in. They basically use onyx as the lingua franca in order to feed themselves uh into their.

A

uh So uh basically, the the notion is they when they say a device, I'm assuming that they're talking about that in in pi torch language, and it says basically, we can map out no matter what the size of the computer is, but I'm presuming they mean their computer and it says pre-trained models can benefit from their tens torrent features. So that's the automatic deployment flow concept.

D

It's nice to see them uh be native high torch as as yeah completely because most of the stuff starts with tensorflow and then right torch is a afterthought.

A

Well, I mean that's, that's the feat of the model, but the, uh but they also take these other flows as well. So yeah.

D

But onyx is very restrictive, um so if we anything that has to go through onyx, it's hard to do something really novel there: okay, whereas if it's just if you look at just the left, pi torch arrow, you know with full support for conditionals and torch script and so on. That's that's pretty good.

A

Okay, so uh here's what they claim to have the capability on. They have models ready for uh these guys. uh Here's where they're uh uh they're imagining uh this stuff is gonna, be applicable.

A

uh They used uh some examples I think in their uh uh in the other, slides uh for vg uh vgg vgg, yes, uh resnet50, uh there's others uh looking at you know deployment in those areas, but that of course requires pretty much. You know other engagements I would guess so they say their public data is on our development cloud on november 1st, and so they currently are doing evaluations of their uh of their product.

A

So 65 watts uh is what their grade skull uh runs at and here's their bur inference performance.

A

Now uh they basically are looking at these notion of conditionals, with light conditional execution, mixed precision, moderate conditional execution that that that dynamic computation they mentioned over there where they can, I guess they can switch between uh various uh blocks. I think that's where they're talking about the conditional execution and when they can do that, uh apparently they uh you know, boost their performance by a considerable amount, but uh even from the slides that I saw they they didn't really. It has to be in the talk, but from the slides.

A

They didn't mention a lot about what they're meaning about conditional execution. But it's something to look at, because uh that crops up a lot places you have predicated execution or conditional execution where the cost of of taking a branch or uh going one way versus another uh can be expensive in a lot of architectures.

A

So that was that was what I had.

C

Kevin what.

A

C

The unit of measure there, what is score.

A

That's a good question and I don't have an answer for it.

D

I just took it as well yeah, I'm guessing it's in princess per second. Second.

A

Yeah, that would probably be my guess: yeah uh yeah. I was looking at that and saying okay, so it's uh the like. I said you know that I mean. Obviously these are just a slot, I'm just pulling from their slide deck uh selected things, uh and these are just a fraction of the slides, but just just to kind of drive the story forward.

A

So uh anyway, um I will uh uh post this up. Someplace and uh super typically give me some place where I can upload the uh all the pdfs I can. I can do that as well. Okay, people can look at it.

D

Okay sounds great thanks kevin. That was great. Thank.

A

C

D

B

A

You're welcome my pleasure everything else.

D