National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 15 Jun 2020

Previous Meeting

⏯

youtube image

►

From YouTube: How to Evaluate Efficient Deep Neural Network Approaches

Description

Enabling the efficient processing of deep neural networks (DNNs) has becoming increasingly important to enable the deployment of DNNs on a wide range of platforms, for a wide range of applications. To address this need, there has been a significant amount of work in recent years on designing DNN accelerators and developing approaches for efficient DNN processing that spans the computer vision, machine learning, and hardware/systems architecture communities. Given the volume of work, it would not be feasible to cover them all in a single talk. Instead, this talk will focus on *how* to evaluate these different approaches, which include the design of DNN accelerators and DNN models. It will also highlight the key metrics that should be measured and compared and present tools that can assist in the evaluation.

Slides for the talk are available at https://www.rle.mit.edu/eems/publications/tutorials/

Related article available at https://www.rle.mit.edu/eems/wp-content/uploads/2020/09/ieee_mssc_summer2020.pdf

If you would like to learn more, please check out our recently published a book on "Efficient Processing of Deep Neural Networks" at https://tinyurl.com/EfficientDNNBook

Excerpts are available at http://eyeriss.mit.edu/tutorial.html

We also hold a two-day MIT Professional Education Short Course on "Designing Efficient Deep Learning Systems". Find out more at http://shortprograms.mit.edu/dls
------------
References cited in this talk
------------
* Limitations of Existing Efficient DNN Approaches
- Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.
- V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
- Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html

* Co-Design of Algorithms and Hardware for Deep Neural Networks
- T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Energy estimation tool: http://eyeriss.mit.edu/energy.html
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018. http://netadapt.mit.edu/

* Processing In Memory
- T.-J. Yang, V. Sze, “Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators,” IEEE International Electron Devices Meeting (IEDM), Invited Paper, December 2019. http://www.rle.mit.edu/eems/wp-content/uploads/2019/12/2019_iedm_pim.pdf

* Energy-Efficient Hardware for Deep Neural Networks
Project website: http://eyeriss.mit.edu
- Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017.
- Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.
- Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.
- Eyexam: https://arxiv.org/abs/1807.07928

* DNN Processor Evaluation Tools
- Wu et al., “Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs,” ICCAD 2019, http://accelergy.mit.edu
- Wu et al., “An Architecture-Level Energy and Area Estimator for Processing-In-Memory Accelerator Designs,” ISPASS 2020, http://accelergy.mit.edu
- Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” ISPASS 2019

A

I'd like to talk to you about how to evaluate efficient, deep neural network approaches, so we know that you know over the recent years, there's been a huge amount of work in deep neural nets, and we've also known that there's a lot of work on trying to make neural networks more efficient. In fact, this whole workshop is about that. um So today I wanted to shed some light in terms of what are the things that we should be thinking about when we are looking at these various different approaches.

A

I want to say the material that I'm going to present today is in collaboration with many folks um and that the slides are also available on our group website um and also we have a book that we just released that covers this topic. So if you want to take a deeper look, I would invite you to visit this book so again. So the focus of my talk today is how we're going to actually evaluate these dnn approaches.

A

The many approaches that people have come up with um there's been many dnn, accelerators and approaches over the past few years as to cumbersome to cover them all. In one talk, but what we're going to focus on is how to actually evaluate these approaches, um and this will include both the hardware aspect of dna accelerator, so specialized hardware for dna is increasingly popular as well, as you know, efficient algorithms and dnn models as well.

A

So how do you evaluate those and their approaches and what are the key metrics that we should really be thinking about, and that should be measured and compared across these various different approaches and techniques?

A

uh So, to start out with you know, a common metric, that's often reported um when it comes to computation, is uh tops or tops per watt. So top is basically the number of tara operations per second and tops per watt. Is you know how many of those operations per second? Can you cram in under a certain amount of power, consumption and you'll? Often, if you look at you know, hardware, publications or press releases. People often report that.

A

um However, what I want to highlight in today's talk is that it doesn't provide sufficient insight on the hardware capabilities um and its limitations, especially if these are reported uh based on peak throughput and performance, and just some you know, and one quick idea is that you know one easy way to get very high tops for watt is just to use a ring oscillator, and it's going to be the simplest thing. It's just basically an inverter that flips up and down and that's going to be the most efficient tops for what you can get.

A

But whether or not it does anything meaningful is something else right. So it's important to consider more metrics than that, and so, when it comes to looking at efficient processing of deep neural networks, there's money more metrics than just operations per watt itself. So the first thing, of course, that we care about is accuracy right. So obviously, one of the reasons why this field has taken off is the incr. uh You know very impressive.

A

State-Of-The-Art accuracy is achieving on many important tasks, but there, of course it's important to factor in you know the type of data set that you actually are achieving high accuracy of add so quality of results. Very important uh throughput is very important. So often you want to. If you know, if you process a lot of data, you want high throughput so analytics on high volume data.

A

Also, if you want to have real time performance, for example, video at 30 frames per second throughput is also going to be important, um but when it comes to real time or interactive applications, latency is very going to be very critical. So, for example, um you know autonomous navigation where you have to interact with your environment. There latency is very key and latency and throughput are actually quite different.

A

um We will talk a little bit more about that later on, but latency is very challenging to achieve, but also very important, um then there's also energy and power consumption, so on embedded devices that are battery operated, um you're going to have limited amount of energy and then power consumption is very cr critical in the data center, because power is also often associated with heat dissipation and there's a limited amount of ability to cool the hardware in the data center itself, and in fact some of the power is actually used to run the cooling engine.

A

So power dissipation is really a challenge. Of course. Cost is also an issue. um You know whether or not you're paying a large cost for a large piece of hardware or small cost um is really critical. When you're evaluating the solution, um we shouldn't evaluate something. That's you know, you know hundreds of thousands of dollars versus something. That's the hundreds of dollars of cost. Flexibility is really key. So, given the amount of research in the space and development of the space, there's been a huge amount of dnn models used for a wide range of tasks.

A

So it's really important to be flexible and support these tasks, and then scalability is really key. So the ideal thing is that when you scale up the amount of resources, your performance will also increase as well.

A

So let's dive a little bit deeper into each of these key design, objectives and metrics. So the first thing often people think about is how do we increase throughput or reduce latency?

A

So often this is associated with uh reducing the time it takes to process each mac or multiply accumulate, which is the key operation in a deep neural network. This can involve things like reducing the critical path of the hardware, um which often allows you to then increase the clock, frequency or, if you're, looking at programmable hardware. How do you reduce the instruction overhead you might want to avoid unperforming unnecessary mac operations? This can allow you to save cycles, so this will become important when we talk about things like sparsity.

A

um You might also want to increase the number of processing elements, um so you have more macs that can operate in parallel. Of course, here, what's important to consider, is the area cost?

A

So, just you know, increase making things parallel doesn't come for free um in order to increase the number of parallax, you often have to consider the area density or the area it takes per pe or the you know, increasing the overall area cost of the system, um and then, even once you put these processing elements down, um it's really important that they're actually being used to do computation. So the pe, what we call the pe utilization is really key, so we have to keep all the pe's busy.

A

um You know the max you can achieve is the peak performance 100 they're all being utilized, but this is often very challenging to achieve some strategies that people often do to increase utilization is to distribute the workload to as many pes as possible.

A

So as many of them get work, we also want to make sure that the workload is balanced, um because if we put too much work on one processing element, that's going to be the long pole, that's going to limit the throughput and then, of course, you need to make sure that your have enough memory bandwidth to deliver all the data to the processing elements. So that they're not idle and just waiting for the data to arrive and then low latency is an additional challenge.

A

On top of all of these, um which it means it has have a small batch size. So many of you know that to increase throughput, if you run on larger batch sizes, you can amortize a lot of the data loading costs and you can achieve high throughput. But if you need to batch a lot of data, that means you have to wait for a lot of data to arrive and that's going to increase your latency.

A

So this is the tension I was kind of hinting about between latency and throughput, so low latency is even more difficult to achieve, sometimes than high throughput.

A

um One way to evaluate this or in terms of the performance or throughput of a system is using this eye exam framework that we've developed in our group- and this is a really a systematic way of understanding the performance limit of any given dnn hardware as a function of the sphere, uh specific characteristics of the dnn model or the network architecture and the hardware design itself. um So it has basically has two axes.

A

The vertical axis is the number of max per cycle and the horizontal axis, the as the mac predators, amount of data reuse, and if you want to dig more into this here's the archive paper reference. So the optimum case is you want to increase the number of max per cycle. Thus the higher up you are on the y-axis, the higher throughput you are, um and you know the maximum throughput you could get is going to be the maximum amount of parallelism.

A

So this really depends on your dnn model, but you know how many of those max can you operate in parallel? That's what limits it then. The next step is depending on what kind of data flow. Your hardware has. You know, for example, which type of data remains stationary. That's going to now reduce the amount of parallelism you can actually achieve by one. The other thing is, of course, the number of processing elements that you actually physically implement right.

A

So this is going to be your peak theoretical performance is the maximum super you're going to have based on the number of pes assuming 100 utilization? Some of you might be familiar with this plot. This is basically the roof line model that was developed um in 2009 is widely used in computer architecture. It basically tells you that uh you know, based on the number of macs, that you can perform for a piece of data, some amount of data reuse. You can think about that.

A

If you have, you can do a lot of operations per piece of data, then you're going to be what we call compute bound, which means your throughput is limited by the number of pes. Whereas, if you need uh you can only perform very few mac operations for piece of data, then you're going to be bandwidth bound and then or memory bandwidth bound. So there's going to be a slope here, okay, but this again does not provide the full picture, and often, as I mentioned, one problem is people often report just the peak throughput or their.

A

You know their tops, for example, but there's other limitations that will also drop this down. So one is, for example, the number of active pes under a fixed pe array size or the number of active pes under the array dimension. This is the utilization aspect that we were talking about and also this depends on the number amount of storage as well.

A

So often, if you don't have enough storage for your intermediate data, then um you know you can't utilize all your pe's, and so this is really going to push your peak performance down or your performance way be. You know significant could be potentially significantly below your peak performance. So this is one challenge, and this is um we're going to hit towards this later on.

A

But this is, for example, why the number of max doesn't necessarily give you an idea of throughput, because the sh, the the max might not all uh be able to be processed by the processing elements in terms of depends on how we map that dnn model onto this array of processing elements um and then, of course, even on the chip itself between each of the pe's. If you don't have sufficient bandwidth to deliver the data, your uh your performance is going to come further down.

A

Your throughput is going to become further down, so you know the big takeaway here is that there's many factors that affect the peak performance. We shouldn't just look at the number of processing elements and say that this is going to be how fast we're going to run. Similarly, we shouldn't look at the number of max and say how quickly this is going to run because there's many factors in terms of how many macs are actually going to be processed per second or cycle.

A

um Another really important things we mentioned is energy and power consumption, um and so, as it turns out, most of the energy and power consumption is not consumed by doing operations like additions and multiplications. Actually, most of the power consumption comes from data movement. So how do we coming from delivering uh the data to the multiplier and ads?

A

And, of course, the further off you away?

A

You go so, for example, sram is, you know from a small memory on chip um would be five pico joules, which you can see, is already higher than um the other operations, but if you go from off chip from dram much further you're talking about orders of magnitude increase in power consumption, so one way that you can try and address this from a hardware standpoint is: we can exploit data reuse, which means that once I read the data once or move it from an external, you know expensive memory once I try to use it for as many operations at a given point in time.

A

You also could try and reduce the energy per multiply and accumulate. So this could involve you know special hardware to reduce the switching activity of the capacitance or again reducing instruction overhead.

A

um You also want to want to avoid unnecessary max, for example, exploiting sparsity. If you can increase energy or increase the energy efficiency, then you can for a fixed. You know maximum uh power consumption, which is dictated by the heat dissipation.

A

We can reduce the uh or we can increase the number of max and operate in parallel, all right. So there are tools that are available out there that allow us to evaluate the efficiency or evaluate various different dnn processors that are out there because there's been a huge amount that have been developed over the past few years, and so this is a set of tools that we've developed within our own group time: loop and xlr, just in collaboration with the team at nvidia um and what we really need in this space.

A

Is you know, given that there's been so many dnn processors that have been developed? We need a way to systematically evaluate and compare all the dnm processors in this design space. But we also like to be able to rapidly explore the design space as well. So once you can compare them, can you also find better network architectures or processors themselves?

A

And, as I mentioned so one tool that we developed within our group is xlrg. It gives you an early stage, energy estimation at the architecture level and by architecture I mean um you're, giving it architecture, components like number of processing elements, memory, size and on-chip networks, um and then we use time loop to map the dnn model onto this architecture.

A

We could also accelerate is also flexible enough to support emerging technology. So we know a very exciting area of research. Currently is looking at. You know: how do we get new devices or new circuits to perform efficient computation? So, for example, memristors have you heard of compute and memory, and so these various different technologies can plug into xlrg um and it can be factored into the design of the hardware itself and potentially the design of the algorithm itself.

A

And then, as I mentioned time, loop is a dna mapping tool that allows us to basically map the dnn onto the hardware and determine the various action counts like how many operations? How many max does it do? How many reads or rights does it do, um and this whole tool is open source and really? This is a tool um that is used to try and bridge these various research communities from the algorithms to the architectures and circuits and devices, and it's all open source. I invite you to visit this website.

A

um Of course, any tool is only as valuable as it is accurate, so we validated xlrg on the iris processor and it was able to achieve 90 95 accuracy compared to um post layout simulations. It can also capture different levels of granularity.

A

um In terms of you know, where does the energy break down between the different components within our itself, including a breakdown of the energy per processing element?

A

Another really exciting area of uh researchers, I kind of hinted towards was you know, trying to enable processing within the memory storage? If we're talking about data movement being expensive, if you could bring the compute into the memory itself, we could really reduce data movement and one of the ways of doing this is using analog compute in particular. What they're trying to do is you can model the memory storage element as a resistor and then by using uh ohm's law.

A

Basically, um you apply basically a voltage to one side of the resistor, which represents the activation value. The resistance uh or the the conductance of the resistor is representing the weights. Then the natural output of this based on ohm's law is going to be um the multiple the product of the two which is going to be the current and then by uh kcl or kershaw's, current law, you're, going to add the currents together and that's going to be the summation of all these products. So this naturally happens.

A

um These devices can be things like non-volatile memory, which is a resistive device. You could also implement them with flash memory or transistors or using you know local capacitors um and there's various different ways of doing the accumulation via either current summing or charge sharing. um But the main challenge here is that there's going to be increased sensitivity to non-idealities right, certain senses do small variations and also, since it's doing happening in the analog domain, you're going to need conversion between analog and digital and vice versa and again so.

A

Bringing this all together is this whole field of processing in memory, and the whole idea here is that we implement the dnn as a matrix, vector multiplication, and the idea is that you know you move the compute into the storage um of that's basically storing the weights of the neural network, so you move the compute into there um and the mate. You know the matrix itself is composed of the store weights and the input vectors are the input activations.

A

This is what we would call weight stationary flow because the weights don't move um and so to in order to reduce the weight data movement. um As I mentioned, we move. We move the compute into the weight itself, so the weights don't have to move to the compute, um and then only the output will be the partial sum, so the cumulative accumulated value so there's very few access through the peripheral circuits or the a to d's and d-days that are expensive.

A

um This allows us to increase the weight bandwidth so that all the weight data can be accessed in parallel, um so we basically increase the multiple weights or accidents. So we can keep all the processing elements busy, so we have high utilization um and, of course, because these storage elements are quite small, we can have more macs um they're operating in parallel for a fixed area and then also the routing capacitance is also reduced, so that helps us both in terms of energy and speed.

A

um And of course this is just trying to show here that also the xlrg tool that I just talked about, as we kind of hinted towards, can also model these processing and memory types of architectures, um and this is just showing a validating on this uh right hand, side here, that you can achieve 95 accuracy and also get a breakdown of the energy across the different layers.

A

Okay, um so that's about you know energy efficiency and power consumption. Flexibility is also really important.

A

We have a wide range of dnns that we want to support today, and we can imagine that's going to increase in the future, so we want to reduce the overhead of support, want to be flexible to support all of these, but we also want to reduce the overhead of flexibility. We know flexibility and overhead there's often a trade-off between the two.

A

um You want to maintain efficiency over a wide range of dnn models, as we hinted towards because there's increasingly more more dnn models, and these can have different layer shapes which can impact the amount of required, storage and compute, and it can also impact the amount of available data reuse as we hinted towards, and they might also have these days different precisions across different layers and data types.

A

uh So you can imagine you know, weights and activations might have different precision and partial sums and different layers of the neural net different degrees of sparsity. So if you can exploit zeros, the number of zeros and weights and activations can also vary, um and then, of course, there's also compute beyond uh max themselves, so, for example, the activation function. So these are also important things um to consider. uh We also want to factor, and so you want to have hardware- that's very flexible. You also want to have hardware, that's very scalable.

A

So often we see that it's important to be able to uh you know, run neural networks or ml on very small devices. So there's this whole field called tiny ml that we're really focusing on you know minimizing the energy costs on these tiny little devices, but at the same time you also want efficiency um in the cloud as well, and some of those challenges are very similar.

A

Ideally, you would have one design where you design uh can scale to increase the number of resources like number of keys and memory it can scale and cover these various different use, use cases um again so just to kind of give an idea of what are the challenges in terms of flexibility and scalability.

A

So one of the issues is that these days, given that there's so many different techniques to make dnns efficient from things like network pruning, so setting waste to zero coming up with efficient architecture, so you know decomposing and using smaller filters and reducing precision, um there's actually no guarantee from a hardware perspective, in terms of which of these algorithms are actually going to be. um You know, enabled or run on that hardware itself, so we need very efficient, dnn, processing or hardware.

A

So this it kind of demands the dnn processor even more flexible, given the amount of efficient approaches that have been presented in the past um and just to give you an idea of some of the challenges here. So in existing dnn accelerators, they often rely on certain properties of the dnn model to achieve higher efficiency. So, for example, one key approach to achieve efficiency is to reduce memory access. We said data movements are expensive, so you want to reduce memory access across the pe array.

A

So, for example, if you read once you want to amortize that that cost across the uh array itself so read one weight and reuse it across the entire array read when activation, we use it across the array itself vertically, um but the amount of reuse you have depends on the number of channels you might have and the feature map and batch size of the dnn model itself right.

A

So given these are two kind of configurations down here in terms of how we might exploit the reuse- and we know that these days, given the amount of great research, that's happened in efficient network architectures.

A

We might not have the opportunity to be able to exploit reuse in all of these dimensions, for example, if we think of the depth wise layer, um you know it only has one channel so in the dimension, where we're trying to exploit reuse across number of channels. uh These types of arc- uh hardware, processors, architecture's gonna pro- are gonna perform poorly themselves. So we need to kind of really rethink in terms of how we're designing our hardware so that, regardless of the dnn model, we can still um basically process them efficiently.

A

That might be, you know, having hardware, that's efficient enough, so that you know you might not have reuse in the channel dimension. Can I actually support reuse in some of the other dimensions right or can I have enough bandwidth so that I can still keep my um processing elements busy? So can I design a network on chip that has higher throughput bandwidth, um and so that was what we explored in the second version of iris, really it focused on balancing efficiency and flexibility.

A

So here we used a flexible, hierarchical mesh on chip network to efficiently support a wide range of dnn filter shapes both large and small different types of layers. So you know fully fully connected depth, wise point, wise convolutional and also a wide range of sparsity, both dense and sparse, um and then showing up here are some of the speed ups that we're getting in the second version of iris, on various different layers of mobilenet, which is actually a very diverse dnn model, and it's also very scalable.

A

So again, the idea here is, as you increase the number of pes: do you get an equivalent amount of throughput and efficiency improvement and it does achieve that. So when we compare to the first version of iris, we get an order, magnitude improvement, both speed and energy efficiency. And again this is having to do with the fact that we've introduced a lot more flexibility um in the network architecture itself or so in the um network itself.

A

All right! So those are, you know, various considerations, but we should also uh for the various different um objectives, but we should also dig a little bit deeper in terms of how we want to specify these each of these objectives or metrics. um So when we talk about accuracy, I should be no surprise to this audience that we shouldn't just take. You know one numbers, accuracy is fine, it really matters in terms of you know what is the accuracy on the difficulty of the data and the task should be considered right?

A

Is it you know 99 on eminence, or is it 99 percent on imagenet or even something more challenging, and because this is because often difficult tasks require more complex, dnn models and so for an apostles comparison in terms of whether or not you know, this approach is efficient, we should really consider you know the accuracy uh or the difficulty of the task when we're evaluating uh given accuracy.

A

uh Throughput is also important, so uh the number of pes with uh uh number of pe should be reported in addition to utilization, not just assuming peak performance, and so one way to report throughput is in terms of the runtime for running specific dnn models.

A

Latency is also very critical, um so, as we mentioned uh batch size effects, latency the larger the batch size you have that might allow you to increase high throughput, but it's also going to increase your latency, so reporting batch size is also key um energy and power. So we should report the power consumption running specific dnn models. Not just you know the peak efficiency of tops per watt um and also we should include off to memory access.

A

We saw that dram energy is also quite significant, so it's really important to report that hardware costs so how much storage on chip storage, you're assuming um number of process selling elements, ship area process technology? It's not really fair to compare a huge. You know the performance of a huge tip with a very small chip, or you know, chips of different using for different transistors or process technology, and then flexibility.

A

Ideally, you want to purport the performance across a wide range of dnn models, so, for example, ml perf is a starting to be a well-known benchmark in terms of giving a variety of models to report on and even better would be to.

A

You know, define the range of dnn models that are efficiently supported, so you know it's fine to not support all possible dnns, but it should be clear in terms of what are the limits um of the hardware itself um and it's really important to consider a comprehensive, have a comprehensive coverage of all these metrics for evaluation, um in order to very, for you know both for people to assess the various different design trade-offs. Just to show you how where this could go wrong um if a certain up metric is omitted.

A

So, for example, if someone didn't report accuracy with the specific data set, you could run a really simple dnn um on a very easy data set and then claim low power, high, throughput and low cost. However, when the processor might not be able to be usable for meaningful tasks when you actually run a meaningful, more complex dnn on it, um another issue is without reporting the on-chip and reporting reporting the offshore memory access. One could build a processor with only max and then like claim low cost, high, throughput, high accuracy and low chip power.

A

But then, when you actually look at the overall system, you know the offshore memory access would be substantial on the system. Power would be substantial as well, so that could be a problem um and then for those of you who work more in the architecture, space and the circuit space, you know, there's a big difference between simulated and measured results. Obviously for measured results where you have to fabricate things.

A

You often build in a certain amount of margins, so that should be considered um so an example evaluation process where we can fold all these metrics together might look as follows. um You know you might start with asking about the accuracy you know, because that basically determines. If it can perform a given task, um then you might ask what the latency and throughput it can basically tell you if it can run fast enough and in real time in particular that might be needed for your particular um application.

A

Then you might look at then the energy and power consumption which would primarily dictate the form factor of the device and whether it can uh where the operate processing can operate. So is this something that you could have in your pocket, or does it happen to happen in the cloud? um What is the cost? So basically, how much would you actually want to pay? Perhaps you have to pay for this um and then flexibility? So does it just support this subset of tasks, or does it support a range of tasks?

A

Okay, um so that mostly focused on hardware. um Another aspect is to focus on the co-design aspects of this. um So when we're looking at co-designs we're trying to make the dnn uh models more efficient- and the idea here is that by looking, you know by designing both the models as well as the hardware together, we can achieve what we couldn't achieve with each of these individually.

A

So the first thing you want to consider here is: what is the impact on accuracy right? So um when we're talking about impact on accuracy first, you want to consider the quality of the baseline model. Right, so is the you know. Often we say: oh it degrades accuracy by. You know this amount, but is the baseline model. You know already very accurate or not so how you know what is the quality of the baseline model?

A

What is the difficulty a task and the data set again you might you should, ideally, you know, sweep the curve of accuracy versus latency and energy to see the full trade-off, not just report at one particular data point, um often to support a lot of these co-design things. You need to introduce extra hardware, and these extra hardware, for example, to support, variable precision, shapes or identify sparsity, and so this can actually increase the cost as well.

A

So this additional hardware should not exceed the benefits and also how adaptive it is so the granularity of the adaptiveness, uh for example, the precision, are you, be able to support every single bit? That's going to increase the our hardware over head as well, so the granularity is something you should consider um and then the evaluation. This is something I really want to enforce evaluation.

A

We should avoid only evaluating the impact of these co-design approaches based on the number of weights and max because they might not be sufficient for evaluating energy consumption and latency, um and so again, returning to you know, slide that kind of summarizes the various popular dnn algorithms network, pruning efficient network and reduced precision. Often these focus on reducing number of max and weights, but the real question is: does it translate into energy savings and reduce latency?

A

So if you think back again to what we just discussed, data of movement is very critical and expensive. um You know farther and larger memories consume more power right. So whether or not you get the data from local specialized local small memory, that's really cheap to access which would cost you the same amount of energy as doing multiply and accumulate, but we're talking about 1k by memory or going off to dram we're talking about orders of magnitude increase in terms of power, consumption and energy. It really matters where you're getting the data from.

A

Basically that's the make takeaway here, um and so you can't just, for example, look at the number of weights and say that oh there's a few weights, so it's going to be, um you know efficient it really. The energy of the weight depends on the memory hierarchy and the data flow. So there's a lot more things to factor in than just the number of weights or number of max.

A

um We have in our own group, developed an energy estimation, tool or evaluation tool where you can put in a dnn model and based on the shape of the dna model and the sparsely the model is going to tell you where the energy actually goes um and as it turns out. It should be no surprise that the number of weights alone is not a good metric for energy.

A

The data types should actually all data types should actually be considered, not just the weights but also feature maps and activations, um and so then, with this information you know. My recommendation is that we incorporate this. You know energy or latency information into the um algorithms themselves when we're trying to optimize and design efficient, deep neural nets right. So, for example, if you care about energy, you should directly target energy and incorporate the optimization of the dnn to provide greater energy savings.

A

One example of this is this energy, where pruning work that we did back at cbpr, 2017 and here the idea here is rather than removing weights based on the magnitude of the weight it used to be that you remove weights. That are, you know, small, for example, that could, for example, shown here half the amount of energy you consume, but if you actually remove the weights that consume the most energy or based on the energy consumption, you can get a 3.7 x reduction in energy consumption, so almost 2x improvement in energy reduction. Here.

A

Another thing is the relationship between operation, so max versus latency. So we know that a number of max is not approximate latency, for example here, um and so this is actually a plot from google mobile vision. Team is showing that you know on the x-axis. You have the number of max and the y-axis you have the latency, and so for, given the same number of max, you have a 2x swing in terms of latency. For the given same number of lanes, you can have a 3x swing in terms of max.

A

You can't just look at the number of max and say: oh, it's going to have low latency and so, together with google's mobile vision team. You know. One way of addressing this is to again try and incorporate latency directly into the optimization of the dnn itself. In this particular work called net adapt. What we're trying to do is automatically adapt a pre-trained dnn for a given platform. So what we call platform-aware adaptation, um we use empirical measurements um from the platform of latency energy to guide the optimization itself.

A

The reason why we do empirical measurements is that you know you often don't have a lot of the detailed information about the underlying um platform architecture or the tool chain itself, um and another benefit of this particular adapt work is it requires very few hyper parameters to tune um and so that aids, with ease of use, which I'll talk a little bit about next, um but in general, that adapt is available on its project website. Just to give you a new idea of what it basically does is. It starts off with an initial network.

A

It has certain latency and accuracy that you want, but it has a budget. You know it wants to reduce the latency by let's say 80 milliseconds, and so what it does is that it's going to reduce the number of channels for each of the different layers for one layer of time it generates until it hits the actual 80 millisecond budget. Then you measure its accuracy, and then you pick the one with the highest accuracy and that began becomes the input to the next iteration.

A

And then you again, you shrink the budget every single time, and so here you know, the various few hyper parameters refers to the fact that, for example, you know the only additional hyper primary really need to set is how much you want to shrink by per iteration itself.

A

So this is a very simple form of nas, but the key message, though I'm trying to emphasize, is that you want to use empirical measurements to drive um the decision making here, as opposed to a number of ox, and by doing this by using empirical measurements, you get a much better trade-off compared to other approaches so shown here is not adapted in red, so this is the accuracy for some latency trade-off again here we're showing a complete sweep or trade-off curve. I want to emphasize this, and this is on the smallest mobile net.

A

You can still see that you achieve you, know um 1.7x um speed up by taking latency measurements into the design of the neural network as opposed to ops compared to previous approaches. um Another thing I just want to emphasize is you know. Often these experiments are shown on image classification, but there are other applications. Speaking of challenging tests, there's other applications. That also could be valuable, also more challenging. So one example is monocular depth estimation. So, given an rgb image, can we assign depth measurements to each of the pixels?

A

This is useful in a wide range of applications, but you could imagine you know from robotics to ar, and you can imagine this can be more challenging than classification, because now your output is dense right, so you need values for every single output value, as opposed to just summarizing. In terms of you know, one class label itself. Now you need an array of outputs, so it's going to be harder to.

A

um You know reduce in terms of complexities in size, and so we want to uh see if the approaches that we just talked about, like netadapt, still apply to this- and this you know shows an example of of um in this case. Yes, it does indeed uh work so for the same um accuracy, and here we're measuring delta, one, which is the percentage of pixels which are within 25 of the right accuracy. um We can get a 10x over an order of magnitude speed up using netadapt, along with compact network design and deposits decomposition.

A

uh This work was presented at icra last year and we actually also got a toronto, an iphone at 40 frames per second, and you can take. You know real data in real time to show that it actually works um and all the code and stuff for this fast step is available at the website, but anyways. The key thing is, I just want to show that it's important to evaluate on more tasks than just image classification, in the sense that, as we mentioned, the task also really matters when we're talking about co-design of algorithms.

A

um One other thing to mention is okay, so we talked about processing in memory early on again, the challenge is there is non-idealities and the analog compute and lower bit widths right um and the loss. Also, you know adc. So the moving between analog and digital compute is going to be a challenge. um An array utilization as well so often because moving data from analog to digital digital analog is very expensive. Often it's desirable for processing and memory to like have these huge arrays, so you can amortize the cost.

A

um So again, this has great promise in terms of trying to reduce data. Moving being you know, achieving higher performance, um but the thing that I'd like to think about in terms of co-design is also important to think about how we design dnns for um these types of processors. So, usually you know, dnns are designed for digital processors like gpus, but when you start thinking about how to design dnns for processing memory, it could also uh lead you to different conclusions.

A

So, for example, uh because the devices are noisy, you might need to design dnns with high robustness and we've seen in the past shown here on the right hand, side that you know dnns, that achieve the highest accuracy might not achieve the highest accuracy when you start adding noise to the system, so the rank order of these are different as we increase the amount of noise um and then, furthermore, there's this trend of you know: making more efficient networks with fewer weights, but in processing and memory we're trying to reduce the amount of uh you know.

A

The cost of weights is very cheap. It's weight stationery because we're basically bringing the compute to the weight itself um and also they tend to have very large arrays, as I mentioned, to amortize the cost, and so, if you make the number of weights fewer you're going to reduce the utilization, basically reduce the benefits of processing and memory.

A

So you know the key I take away here is: maybe we want to rethink how we design dnn, specifically for processing and memory and goes against the existing trend of deeper and smaller filters?

A

um Other things just think about uh when we're thinking about different co-design approaches, important to you know, consider the um you know the difficulty of using or tuning the certain algorithms of hyper parameters is the relationship between the hyper parameters and performance. uh Clear. If there's uncertainty, it's going to be more difficult to tune other aspects that affect accuracy, latency and energy. So what is the data augmentation pre-processing?

A

What are the optimization, algorithms, hyper, primers learning rate schedule, batch size, um training and fine-tuning times? All of these things should be factored in when comparing the various different approaches um and then there's the approach. How does the approach on different platforms right? Is it a general method, or is it very specific to a given piece of hardware itself? So, to summarize, the number of weights and max are not sufficient for evaluating energy consumption and latency. If that's a big takeaway, you should directly target the metrics like energy and lasting incorporate them in your design.

A

Many existing dnn processors rely on certain properties of the dnm which can't be guaranteed, so we need to introduce more flexibility into the hardware itself and again we need to think of a comprehensive set of metrics and benchmark marks, not one particular uh not just focus on. You know: operations per watt or number of max number of uh weights. um This work is done by you know various students, and we want to thank verified sponsors.

A

If you want to check updates for on our research, I invite you to follow us on the twitter feed um again additional resources here, so you want to take a deep dive into this. I invite you to go. Take a look at our proceedings, pictures there's about 30 pages and also some slides on our dnn tutorial website.

A

As I mentioned, we also have a book. That's just coming out uh this week on this topic, so invite you to go to go, get does a deep treatment across the entire stack from architectures. You know circuits all the way down to algorithms as well. We have a whole section on the co-design aspect um and some of the uh chapters are available already on our dnn website, so certain excerpts. So I invite you to check that out.

A

We also teach a class at mit which will actually be held virtual this year, so those if you're interested in spending two days to learn more about this topic, invite you to check out uh the course and then, of course, all these talks are available on our group website and youtube channel. Thank you very much.