Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so what I'm going to be presenting today is the collaboration with Kasich one of Samsung Kasich Juan is a integrated circuit designer who at Samsung he and his group research group had observed I work on squeeze net and began to look at the potential for building accelerators that were particularly tuned to small Nets to run on small processors for embedded applications ranging from autonomous vehicles to mobile phones and so forth so oh this is at the core of this work is really key saans key Sox work and my other colleagues at Berkeley okay so I don't think we have to spend a lot of time on motivation we're all here based on the motivation but I do think my particular angle was I'm just will they see the dream big Buddhist computing and be kritis intelligence being realized the potential for it to be realized for the first time and the very particular medium for doing that are small energy-efficient deep neural nets it's really true and these neural nets offer the best solutions for many of the types of problems that we would like to do to bring intelligence into our automobiles into our factory floor into our office even into our kitchen and it's probably going to bring intelligence in places that we don't want intelligence but that as we're seen in the broad work and surveillance around the world so as we do that you know where I like to poke fun at my friends at this conference you know the core computer vision types who often don't report run times power is something they get in their electricity bill that's about as much as they know about power and so forth and they live up here and 223 bought devices and number or stronger like a DG x1 and not really think twice about it but I think it's worth pointing out that if we want to bring that intelligence in all these different places we've got to begin to play this axis and so this is all pretty anatomically correct and priced out with ours a graduate student labor of looking at different devices from a laptop down to a pedometer or I watch series three and so forth at the front is the power dissipation the the peak power dissipation on those that's important because the packaging and human factors costs you don't your cell phone to heat up too much partially because of the cost of power dissipation in that also because it's just uncomfortable for your to him to hold something up against your ear that's that warm oh the second figure of these hours here battery life that's related to the energy that's the integration of the power over time and that's that's really related to your battery life so we all know we get tired I mean I carry around a Chargers in my pocket just because when I'm using my GPS my my battery is like likely to run out so so we went to basically this is a long motivation that first of all these are the regimes that we want to operate in which are much lower than what we have been operating in and so we should be very concerned about power and energy sorry to put that another way so here's what you've got in the trunk of these experimental urban taxis it's a you know more it may in fact be a DG x1 or something computational equivalent and it's literally three kilowatts that's not a number I pulled out of my head we have I think Toyota Research Institute is come forward and so they're experimental cars packed three kilowatts in the trunk even the most optimistic notions of deploying something like an urban taxi are offering in the hundreds of watch regime but we want to be able to supply this kind of computer vision technology into a simple l1 through Ellis driver assistance systems for ordinary passenger car which means that we need to get into this 10 to 30 watts machines although I did hear when the CEOs of the leading-edge companies for you know low-level autonomy vehicles talking about they're going to pack sixty watts in there in there accelerator so maybe 60 but that now as he mentioned that will require water cooling 10 to 20 Watts will definitely get you in something that you can just operate without a fan and I finally we would like to get down more broadly this is very much focused on Thomas vehicles down to these you know sensors and wearables all the way down to the you know milliwatts regime so the question to us as researchers is what can we do with 10x less speed or computational power and 100x less power dissipation now the good news is on the problems are familiar to us there problems that we have worked on for a long time we can do all that intelligence everywhere that I've been talking about Rover with some familiar problems image classification object detection and semantic segmentation so the problems aren't changing it's just the computational requirements are and similarly the the basic solution we will employ of convolutional neural nets or something their own that family this is also not changing this is so so there's less that is that is in common here but there are some things that are changing okay say I what I wanted to do is look at the most aggressive of those particular applications which is object detection that's the most computationally demanding of the of the problems that I showed you and I have to thank the mobile nets folks well Minette v2 paper here did a really nice job of pricing out for obviously for purposes of comparison relative to their latest versions of mobile net but what they show is me as average precision across cocoa note these really small images parameters ops per image and then I I further price us out if we really want to almost everything we're thinking about employs a video camera and 30 30 frames per second is a pretty conservative most of them go to 60 frames per second today so looking at this I we can see you know this is particularly with these with more energy and efficient Nets more power energy and fast notes it's not working not unreasonable right even on contemporary platforms but that's today for uptick detection and this is Coco up here there is a mention at where a lot of our conference lives here is Coco here's just a good old 1080p camera 12 times bigger than Coco and and if we want to go to Ultra HD which is when I talk to say tier ones and so forth and automotive that's where they want to be shortly then we've got to go 48 8 X bigger okay that's the injury then the insult is these are super linear algorithms right so this is even though the computation doesn't look too bad just a minute ago if we really want to go to the resolutions that would like we're going to have to go to a lot more computational power so that we need accelerators so here's results on 8:35 8:45 you know very reasonable very good inner performance on processor for say a mobile phone and so forth gets you up to 11 to 16 Giga flops on matrix multiplication and operates a regime about up to 4 million operations per second per mo model so what's really good and so yeah if you've got Peter right here and you want to do style transfer you're probably set you don't you don't have any you don't have any other worry so you can do style transfer on a mobile phone and this processor no problem however if you want to do object detection at speed on a 4k 4k video doing 30 to 60 frames per second probably you're going to need an accelerator in any case just in terms of the potential um here are three at 2018 is sec papers and what we see here are ranging from 250 to 10,000 X improvement and the millions of operations that we can do in a single middle of what right so so you can decide whether I think as Samsung did like well this is too good an opportunity to pass up we need to get started Ori or if you're trying to do something very computationally aggressive then you absolutely need something like this so let's take a look at the basic architecture of these at the very core they pretty much look the same we have a mesh of processing elements right there at the core and and I think that the native term for this is spatial parallelism I grew up in calling this after HD cone systolic parallelism we're just going to pump data through these individual elements the processing element on many of these accelerators couldn't be any simpler it's the kind of thing you do in an undergraduate project you just have a multiple humility you know some multiplexers to data and access to a register file one thing that's going to be very important is to throughout this talk has to understand the access to the DRAM so this is sitting off chip and so in the crowd that I usually hang out with you know this is this kind of intuition is in their bones I have no idea about this audience but just like this is what this is a long trip as I say if you've got to go out there to DRAM to get data you're making a long trip okay okay Sano that's the poor or basic accelerator architecture now I've given trials of the stark and seeing complete confusion got a lot of questions and and oh I see we really need to talk about different assumptions around accelerators for most people the only neural net accelerator they know is the TPU and that's for originally for efference and now training and inference in the cloud that runs on tens of thousands of processing elements and that systolic curls and imma show you it's you know goes in the tens of watts machine your batch sizes are as big as as you can get accuracy around say in training so it could be tens to hundreds or you have what will be important to see later as you have hi or perhaps even complete interconnectivity among the processing elements and you're operating in a floating-point regime okay so when ki saw Kwan came he had a very different objective he's you know looking at a single IP block to embed with other blocks on say a multiprocessor for a mobile phone okay so we're talking eight to sixteen pease we're in single-digit millimeter squared we've got milliwatts not watts to work with very important in terms of the operation the algorithms we're talking about a batch size of one what does that mean we're going to process one image at a time whether it's style transfer or whether we're just trying to process images from a camera and autonomous vehicle as fast as possible we're going to be processing them one time we're not going to be able to batch them up okay we're going to be hampered by local mesh connectivity and that means we can only kind of bring things into the edge of this accelerator and then processors have to pass them to their neighbors that's a lot different from we can type them in to any processor or front or go from any processor that we want so kind of these are two very different extremes and if you have the picture of this in your head as I spend the rest of the talk talking about the one on the far right it's going to be very confusing so remember we're talking about modest numbers of processors here now I did want to note that there was something in between and that's basically when you are doing an embedded processor so you know that's in the 1 to 10 so what's really green sorry that should be 1 so tens of watts regime and that's like this accelerator for autonomous vehicle that I thought I heard kind of softly announced this week so there you've got your back in the thousands of P receives so that's that's yet another has a little bit of different characteristics and you're probably going to do some sort of mixed connectivity among your processing elements ok so you wait they happen together well I'm sold I actually want to get in the business of designing one of these accelerators let me on it what do I do well what the circuit design wraps around the doing is one quick way to get an accelerator is to reduce the bit precision you can get factors of 8 to 32 immediately by just reducing the precision don't seen these papers so much attention and accuracy again I'm talking about the circuits here another is too you know people have heard about alex net in defense alex none has been we're very widely benchmarked so there's a natural motivation to look at how alex net but i think if you use that's kind of what it's now kind of big blow to net as your as your only evaluator for your neural net then you're in guiding again this is kind of workload driven design guiding your design you're gonna be dieted guided the wrong wrong direction and then a third thing you'll see pervasively is evaluation on em nest or c far and i think this crowd certainly knows you know as researchers computer visionary on this we may get some early intuitions from working on that in terms of really guiding either either neural net design or an accelerator this is the long way to go so these are these are real papers across the top conferences and these are real chips and having spent a lot of time in the chip design world i've huge reverence for anybody come and missus tape out probe and test a chip and show any kind of results at all so i I don't mean to trash this work in any way of tremendous respect these are all a lot of tired people behind every one of these these these papers but in terms of presenting this as oh this is the way we should do neural net accelerator design I think I can't give us a very happy face and this is not the way to go so what's the alternative so the alternative is Co design I think particularly if we're doing a specialized accelerator we need to understand our application very deeply we should be like kind of moping as a group you know your your design team should adjust a people understand the application deeply people understand Nerone that architecture and accelerate our steeply and people understand not just you know Alex net or or even mobile net for that matter but the principles of deep neural net design with all three of those working together I think we can have much better better designs so starting with the application Mohammed I think already kind of set this up if you want to do you know if Peter decides to do to start up and do a dog identify application for the iPhone and I think he could do a dynamite job then it's really pretty important to keep all these dogs straight on the other hand if we're doing an object detector an autonomous vehicle we don't currently distinguish between a - hound and a greyhound right so there's just these are these are all dogs and so the the obviously the the accuracy that you need in distinguishing these dogs is is very different so for those of you who saw my talk at the workshop and autonomous driving you see that that I'm I really believe and I liked my muhammad's segment there it's the first experimental work that I'd seen in like okay let's start playing with our notion of accuracy and how does that change our whole design that's exactly what you want to do when you want to do a hundr on that accelerator or even a novel neuron that targeted for a particular application okay so all that said is that what I'm going to show today no because I'm trying to simplify and be able to compare things so simply to facilitate comparison I'm not I'm not going to talk about em eNOS tour c4 thankfully I'm gonna talk about imagenet but we will be talking about just good old-fashioned classification on imagenet okay so focusing on that neural net accelerator scientist because I mean classification on imagenet that's what we've been told to do that's what we're building in some sense an accelerator for let's look at that okay what is it that we're accelerating well we're talking about computer vision so we're talking about convolutional neural nets we're talking about convolutional neural nets we're talking about convolution so I don't think I need to spend a lot of time on this this is just a typical convolution operation here and creating the the output from among these different say let's say 512 filters now when I look at this I mean sometimes I've worked in somebody different feels I know what is my core competence you know and then I think I'm really a computational scientist that's where I live so when I look at this I go this is my dream come true of computation this is the kind of computation you know when I was a teenager I dreamed that someday I would be able to be employed to compute this and why is that well you know I've developed some some terminology around parallel computation structure wise it's pipe and filtered simplest structure that we have easiest to predict just simple forward flow from a software standpoint I can do coarse-grained pipelining I can do operator chaining I can do code fusion it's great it's the greatest the core computation matrix-multiply parallelization of linear algebra is the most studied of all computational patterns there's nothing that's been more beat to death than linear and linear algebra in general and matrix multiple location in particular and if we don't want to do anything too smart we can rely on a lot of library support from Blass and klq to bless and so forth another wonderful thing we can statically scheduled this computation you know I think people in this room take that for granted that's not true most of most conference rooms in the world most conferences not nothing can we it can't even get good statistical prediction because the results are so data dependent this thing runs we runs the same schedule every time so we in a way it's a dream come true in the sense of it's it's it's the simplest possible problem but another way that means boy anything less than peak and we are we've got to be pretty accountable for what happened okay so I'm going to start something that's going to be a little busy and to make my point so I'm going to find myself going back and forth between the computations were doing and the basic architectural principles so I will try to keep that you know for whom it's not immediately obvious now we're talking about architecture or now we're talking about the so he's just talking about the algorithm now I'm back to the architecture so here's this is all anatomically correct this is done by someone that's ensign what we have here are basically what I'm going to talk about is latency - either on kl1 - on Chappell one on Chappell - or off chip DRAM and ruling it's a one-liner for the purposes of this talk that we got we lose hundreds of cycles anytime we have to go off chip and so relative to the actual computation even even to go to l2 we're masking the cost of computation in our memory access in other words we can count max if we want to they it's it's relevant to do that but at the end date it's the memory accesses and the access patterns that are going to determine the computation thank you okay so energy is even more related to off chip and memory access and you see we actually get a 10,000 here when we have to go off chip okay so back to the computation now if basically this is we're doing a matrix multiplication and at the very core we're taking an input times a weight and we're producing an output and if I have more time I dialectically go through but the key to making this run fast is tiling this computation scene where can we find the reason this computation and I'm gonna have to speed up looks like just a little bit okay back talking to the architecture we see two basic ways to support that rings one is let's keep the weights resident and run the inputs and outputs over it that's a weight stationary or what's do partial sums in one place and run the weights and the inputs to that it's really just that simple so it's a big decision reuse is the most critical so how are we going to support we use in our hardware okay so this is weight stationary this is do I understand what the TPU does and I'm just going to do a quick animation of that so basically we leave the weights in one place and then we passed the inputs over the producing outputs at a time okay and then output stationary here we keep the partial sums and I should point out this here the last one I saw I was shown with 16 Pease here it gets really messy with 16 here's just we have 4 so you have 4 PS just for purpose of illustration we're keeping the output stationary and what we see is we keep those output stationary and we compute the partial sums of the inputs and weights those until finally we've completed the computation ok so we often talk about computing convolutions as though there was a single convolutional net or something a similar single convolutional layer in fact there's a lot of them here's our old familiar spatial convolute loops here's our familiar spatial convolution here's the increasingly puppy were point-wise convolution here's the depth wise convolution that the mobile net people get a lot of mileage out of channel shuffles and shift now if we look across ok it's a little biased squeeze now - you're familiar with we presented shift net this year you'll hear about squeeze an outlet squeeze next later today shuffle not a mobile net but will not be - actually this is we won we're presenting this year at cvpr and here are those different types of convolution and here's the presence in those different Nets so if we're creating an accelerator that we think you know one or another of these families of Nets might be a net that we want to run on there then we need we need to as you can see from this I think the point is that we should be able to support fairly broadly across these different types of convolutional okay so now back to architecture talking about reuse our metric for what does good reuse is this thing called arithmetic intensity and that's just for every byte that we pull from memory how many floating point or or what's the same Mack Mack operations can we do okay that's that's our simple metric all right and what we see is that these different blocks these different convolutional types of convolutional blocks have very different behaviors so these are the we could again we can figure this out statically so here's the algebra at the bottom note that we're not distinguish win here between a register access and l1 l2 or even going off chip we're designing in general how many how many bytes do we need to pull these so we normalize this to a good old-fashioned spatial convolution say a three by three and point wise doesn't do quite as good neither does group convolution the boy depth wise device has a lot of problems its arithmetic intensity is 50 times more so that means that um we're going to have to do kind of normalized 50 times more memory accesses and here are the particulars so this is a this is not outrageous this is not made to make deaf boys convolution look bad or something else look good this is just kind of a vanilla middle layer of your neural net you can check out these numbers and again the main point is there's a factor of 50 although we think of these congressman layers is kind of the same or kind of we're computing convolution we have factors of 50 differences in their data reuse okay so now back to the architectures so we're looking at output stationary and weight stationary and we're going oh actually unfortunately if we want to if we want to support these broad convolution types and these different layer configurations maybe in the interest of time I'll just point out the simple fact that well output stationary does pretty good when you have lots of intermediate output partial sums to compute and but when you have fewer doesn't do as good weight stationary does find on those point-wise convolutions which is good we will see that that's like 95 percent of some some modern Nets but is very poor and bet depth wise which is another popular so that's not so good so to look at the the particulars of distribution so here's another caste another family of different nets that we looked at and basically the percentage of the total Mac's that are consumed in the different layers types now I'm pointing out here this is where weight stationary does really bad this is where output stationary does really good this is where output stationary does does really good so what we see is is quite a mix of different different types so the conclusion again if I was in a classroom I did this dialectically and really enjoy the different answers and this interest attendance talk I'll just go straight to pump point well I don't know why couldn't we support both data flows so that's exactly what we did in this squeeze elevator basically it's a hybrid way stationary an output stationary data flow I'll be talking about it next week at the design automation conference but kind of forecasting it for you today so I think with enough pomp and circumstance I think you could yeah I could convince you that this was a brilliant observation maybe right now you're just think you know that was pretty obvious but those are not inconsistent brilliant can also be obvious and if I did it myself I wouldn't be talking up but Kasich 1 is the real person who had that light bulb in his head okay so how does that do so here's output stationary architectures here's whate stationary here's the squeeze elevator and you noticed we Keyes likes original motivation is let's see how we can do on squeeze net family and they saw this big spike here so sorry this is total execution time and this is the amount relative speed up of one architecture over another so basically working at how well does the gray bar do against the other two and in making this run well on the squeeze net 1.0 what you see is he he really did a heck of a job on on mobile net which is for some of you who know who tried to make it run really fast a very finicky net to try to get performance on so 6x and our roll that's that's that's a really good day okay so now I went to jump with all this talk about reuse I wanted to jump all the way back to applications and talk about application characteristics so Vivian say is another one of pioneers where area she's raised awareness around architectural concerns for supporting neural nets and importance of current energy and so forth and she drives me crazy when she puts this up this is a normalized energy consumption so going this way is worse and she stands up and says that squeeze net consumes more power or more energy than Alex net and twitch every time I see this so again I did this dialectically what's wrong with this diagram I took it took my grad student Beach and blew a lot of it time figure it out but although often unstated in six or seven papers that she published it she's pursuing a batch size of 44 to 46 which basically says that with Alex net all those big hulking weight parameters it doesn't matter so much because you get to reuse them over and over and over on her architecture so there's not much penalty on the other hand if you look at our results so here higher is worse and you look at a batch size of one again we're processing one image of time and times vehicle we're doing one style transfer at a time we see squeeze net I think just where it belongs the most energy-efficient among all these nets okay so in terms of Co design in my presentation next we could design automation conference I'll be then talking about okay now that we've got a lot of insight from the architecture side let's go back and look at the net but and what we did was during my amiracle ami we we went from squeeze net just to squeeze next unfortunately he's got a poster here and you'll be hearing a brief presentation on that in just a moment so to summarize um I think we're going to need neural net accelerators to meet the constraints of the embedded vision applications and you can sit there and go was that a good to have or a must have that probably depends on the application but you know if you're going to do that [Music] [Music] [Applause] [Music] [Music] so the way you you know so [Music] [Music] [Music] [Music] [Music] [Music] another I mean so again it'd be very application driven we have a paper and an emphasis here basically if you're doing LS TM acceleration then getting you know the Ostia model and intend on chip you know there's a lot that we can do so if you told me that well and we really want to be able to qualcomm we want to be able to recognize speech better than anybody else and say yeah what's well see if we can't do some some a lot of SRAM fairly modest computational structure and how about we can do really well with speech recognition now if we're doing convolutional neural nets these we currently still have to do crank through a lot of matrix multiplication and therefore there are the trade-offs again it depends on precisely you know how big image you want to process what your Layton sees and so forth but but SS Ram is I mean it's so thank you for being up SRAM it is a first-class citizen the Devils in the details in terms of how big it should be

Info

Channel: DeepScale AI

Views: 893

Rating: 5 out of 5

Keywords: Autonomous vehicles, AI, deep learning, computer vision, ADAS, neural networks, self-driving

Id: SxbT3ldoo-I

Channel Id: undefined

Length: 37min 38sec (2258 seconds)

Published: Tue Aug 14 2018