Design for Highly Flexible and Energy-Efficient Deep Neural Network Accelerators [Yu-Hsin Chen]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's get started so I want to welcome everyone to you Shin Chan's decent events and I welcome you on behalf of myself and Louis Joel ever who is I should record by the muses PhD thesis so you should received his bachelor's degree from the National Taiwan not Hotel in University and then he got his master's here at MIT and then he joined me for his PhD degree at MIT so he is one of the first students who joined eyes with when I returned on a fish-like check my emails last night so he official for students like just one week the other two in an offer I'm not even in are defended last week so it's the very special week repeated they're both defending at this time so they you should has the I think the third first Journal paper of consumer group and also the first ship those kicked out from our group so about there are a couple of things I just want to mention about this PhD thesis so you should worked in the area of hardware design for DNS as many of you may know it's a very fast-moving area so you need to think really deeply in order to be five years first PhD so through his efforts you have been able to make a strong impact in both the computer architecture community as well as the circuits community in fact I sort of just because to notice of his work on iris received my go topics award that's when the computer architecture community and his paper JSF see is basically the top downloaded paper 2017 and continues to be so right now and in addition to iris usually those who made many contributions and all of you civilly help provide some organization and structure to this really rapidly growing field so it allows you to get more insight into the different development that's going on and so one example of this how we describe data flows and this is widely used now with many of the computer architecture descriptions of DNA accelerator okay thank you being so much for the nice introduction and thank all of you for coming to my talk and today I like to I share with you the project I've been working on for the past four years which is on the architecture design for deep neural network accelerators that's highly flexible and energy efficient so as we know nowadays deep neural nets are used basically everywhere it's a cornerstone of modern AI assistants enables application ranging from smart assistants self-driving cars and even to play the game of gold but what notice is that those new AI applications they are bringing new challenges to the underlying our assistance in infrastructure for example we need high processing throughput in applications like self-driving cars to deal with a large amount of data we also need low latency in smart assistance to have a very smooth conversation in addition energy efficiency is becoming more and more important because many device nowadays runs on battery and last but not least we need the hardware to be very very flexible because we don't want to just support one net we want to support many different nets for many different applications so those challenges have inspired us to propose a new architecture design for DNA acceleration and we call it iris iris is optimized for the three pillars of the architecture performance energy efficiency and flexibility specifically we have to assign two versions of iris the first version iris be was decided at a time when the d-ends was very large involved a lot of computation what we found was that being able to optimize for data reuse is going to be the most important idea later on we proposed our is v2 because we found that the development of deep neural Nets are becoming more diverse there are large models there are small models and we found that there is an issue for those models who utilize the existing hardware so the idea of the second version is to improve the utilization of the hardware specifically in terms of the processing element so here is a list of our contributions to address those thing I just described I'm not going to go through the list but one thing that's very important to understand is that as Vivian said this is a very fast-moving field with a lot of work going on and a lot of time you'll fill many of them are a bit ad hoc so what we're trying to do is to provide a systematic way to analyze this problem so later on we can take those methodologies to implement new architectures because of time constraint today I'm only going to focus some key components of our talk for you to take home today but if you are interested to know more about our work here is a list of our publications throughout the years as you can see we have bridge in both the community of architecture and circuits and you'll notice in the talk today that we actually do the coop tomato optimization and design all the way from architecture down to the circuit level so let me first give you a bit of background so the idea of deep neural nets is trying to use many layers to hierarchically extract meaningful information from the input data so for example to recognize the image of the car at first extract some low-level features like edges and then gradually put them together to form an object like a car here nowadays the deep neural Nets can be very large they can contain like thousands of players and there are many different type of layers that people consider constantly developing so we have many different layers but there's one type of layer that's really fundamental to all different deep neural nets and that's what we usually call the convolutional or fully connected layer they account for over 90% - even like 99% of the overall computation which is why this is really the target layer for acceleration in hardware so the type of computation that's performed in this type of fair is what we called a high level high dimensional convolution basically we start from a 2d convolution in which we apply the filter on top of the input data which we call the improve feature map or F map so this is a really standard 2d convolution we do element wise multiplication and then that will generate a lot of data called partial sums that's going to be accumulated together to generate just one output value and then to generate another output value we do a sliding window processing by sliding the filter onto a different location in the input data and this only happens in the collusion layer so that's the only difference between the come layer and the fully connected layer in fully connected layer the filter size exactly the same as the improve feature met size so there is no sliding happening here and as you'll see throughout our talk we use a consistent color coding for our figures we use green for filter weights we'll use blue for improve feature map and we'll use red for output feature maps so you can recognize it better so in addition to 2d convolutions there are additional dimensions in the data for example we get many input channels in the filter and input feature map and what we do is to do the 2d convolution channel by Channel and then we have to accumulate a partial sum from all channel together to just just generate just one output feature map plane but in addition in the same layer we can get many of those field filters so because each of them will generate one plan here with many of them we can generate many of those plans which we call it many output channels and at the end sometimes we can get a batch of data to process for example if you have a video stream right so we're going to apply the same set of filters on top of each of those input feature maps and that will generate a batch of output feature Maps here so that's the type of processing we're dealing with here one thing you'll notice is that there are many different dimensions in the data right and unfortunately they can change the front layer to layer even within the same deep neural nets so we have a lot of different configuration in terms of Chad number of channels number of filters that means we can now build the hardware to just process one specific configuration the how we need to be able to able to adapt to different configurations the second thing you'll notice is that for each layer there's a lot of processing involved you usually get from several million to several hundred million of multiply and accumulates even though we can use Hyper ilysm like what we've done in GPUs to speed up the processing the actual bottleneck is in how we access data as you can see here for each multiplier accumulate units you have to read three piece of data from memory and then you need to write a result back to memory in the worst case if they all go through DRM for data access it will be intolerable an example here is if we want to support the several hundred million Macs in Alec's net we have to do almost 3000 million theorem accesses that's going to greatly impact your performance and also energy efficiency luckily what we can do is we can use some local memory to mitigate this problem so those local memory can be smaller but they are faster and more energy efficient to access so the thing we want to leverage from those local memory is data reuse what we mean by that is we want to read data from DRAM once to the local memory and then we use it much four times from local memory to do the processing in the Mac here and very fortunately there are many data reuse opportunity in a da M for example we have the convolutional data reuse in which case each filter weight is going to be applied to different locations in the feature map and reuse in many different Mac's and vice versa we also have feature map reuse in which case the same activation is going to be used across many different filters and similarly we also have filter reuse in which case the sanur awaits it's going to be applied on many different feature maps so if we can actually leverage all those data reuse opportunities in the case of Aleks net we can actually reduce the 3,000 million accesses to just around 60 million that's around 50 times reduction and that's going to help the performance and energy efficiency a lot but this is only about how we leverage they are reused through the memory and there's actually another opportunity here remember that we can use hyper realism to achieve high performance basically we can supply a lot of those compute data pass here what we can further do is to exploit what we call spatial data reuse with the parallelism what that means is we just read the data for example the weights here once from the memory and then we broadcast that to many different computer in it so we also reduce the memory access through the local memory so now we know there are two properties we like we want to have local memories for data reuse we won't have parallelism not just for higher performance but also for data reuse as well that is why spatial architecture is a very popular architecture that's being used in many of those specialized Hauer you'll see nowadays the special architecture first of all it's a highly parallel architecture you can get from hundreds to million thousands of those aou for processing and also it supplies some local memories as for example we have the retro file for each Lu and together with some control logic we call this a processing element now also we can have some global buffer that can supply data to the array of processing elements and then we can also further customize the unchipped network that delivered data between the global buffer and the pease so we can get high reuse from that as well so suppose today we want to run the operation of a Mac on a specific Lu we have a lot of choices in the architecture to fetch data from we can fish there from the local reactive file we can fetch there from a neighbor PE or even from the global buffer or even the remote UN as well what we've noticed is that accessing data from different places actually incur different energy cost for example accessing data from DRAM is going to be orders of magnitude higher energy consuming than the excessive data from the register file so what we would like to do is we'd like to maximize data reuse in the retro file and then we can minimize the data access to D Ram and what this involved is to construct something called a data flow a data flow basically is a rule saying how you like to order the operations both in the time domain and both in different p/es as well so if you can find a clever way to do that then you can exploit data reuse in those low cost memories and also with the parallelism you have and this is a very hard problem because first of all the neural nets are very large nowadays and we usually have a limit amount of local storage like retro files or a number of pease so how do we find a way to chief a high amount of daily use and at the same time being able to adapt to different configurations of the layer that becomes a really important question to answer and there are actually a lot of work in the field that's trying to enter this question and we are trying to gain some insight into how they do it and throughout our research we actually found that despite their different design trade-offs and implementations we can categorize most of the work into just three major data flow categories the first their flow is what we called a weigh station area flow the idea is that we want to read each weight just once from the beer in and put it into the register file and then you access all the future data from the Reggie file by doing so you minimize the rate reuse the way access is because you always access data from the register file and there are actually a lot of work that do this type of data flow and most notably probably the Google TPU also can be categorised into this category in the second day of flow is called a output stationary similar two-way stationary the output station is trying to minimize the data access of data energy consumption of accessing partial sums so for all the read and write of the partial sum you are going to do that through the local register file instead of pushing it out back to the global buffer or DRAM and as we can see there are also many different implementations that use opposition airflow and actually they're very popular in some reason your publications in the last data flow is what we called a no local reuse data flow it's different from the previous two in that there is no local relative file in the PE so what you do is you read the data from the global buffer you do the computation in the PE and then you just push it back to other PE or back to the globe buffer so what it does is that trace of the area for some a lot of local smaller Reggio file for a much larger global buffer here and that's good for fetching a lot of data at a time from DRAM to the chip so you minimize the access to D Ram as we can see there also a bunch of different work that work on I use the no local reuse data flow so after we examining those three different exact data flows we actually found that it's possible for us to build a new one we called it rosacea in there flow that can further optimize for energy efficiency the idea is that we want to maximize the de reuse through the register file but we don't want to do that just for a specific data type we want to optimize for the overall system energy efficiency so here's how we do it we'll start from a simple one deconvolution and we'll build it up to the full high dimensional one let's say we have this 2d case here we will first just fetch one roll of filter weights in one row of info feature map to do the 1d convolution that will generate a partial sum for just the first row of outputs and we'll do this first row inside the same PE through the register file here so first of all we fetch the filter roll into the ratio file and we fetch the first window of the feature map into the retro file as well and then we run through three cycles of multiply and accumulate that will always accumulate on top of the same output a here so you almost always accumulate that in the richer file then you move on to the next sliding window what we see here is that we can further reuse the same field weight here in the wretched file for the next round of processing we will also reuse part of the improve feature map by replacing a width D here so you also get some reviews from that if you can do this ion until you're complete with the processing what this does is you actually maximize the amount of reuse for all three data type in this one deconvolution add the richer file level so now we know how to do a 1d convolution we'll build it up with the 2d convolution so first of all we take one PE to run this first row of 1d convolution but in order to complete the processing of the first row we need other rows as well so we further take a few more peas like two more to run the other road combinations and each of them will generate the partial sums and need to be further accumulated together what we do is to simply pass them through the pea array to do a cumulation without pushing them back to the global buffer so we can accumulate that directly in the pea array to complete the processing of the first row in order to complete other rows we simply take more peas so it reuse the same weight here but we have the shifted version of the input like from Row 2 to 4 instead of 1 to 3 and that will complete this row second and we can also do this ion until we're done with the entire 2d convolution and what this does is that remember we can leverage spatial data reuse with parallelism and you can see that we actually reuse the same row of weights horizontally across the different Pease in a row so we can leverage spatial data reuse there the same thing happens for improve feature map now we reuse them diagonally through the Pease so we can also have spatial reuse there more importantly for the accumulation of partial sums that's also done directly inside the pea array without involving the global buffer so that also maximize the amount of accumulation we can do inside the pea array so that's how we complete a 2d convolution but there are also many more dimensions we have to complete right so that's where the flexibility kicks in let's say we have many more feature maps to process we can simply run through many feature Maps inside this MPE with the field that we have there so we further get more reuse of the filters and the same thing happens when we have more filter weights or more channels so basically we can find out the way to see how many different feature maps or filters or channels we want to put down this mp4 processing so we can optimize for data reuse and this becomes an optimization problem so that's why we build an actual compiler to do this optimization it takes in the specific layer shape of the DN and it takes in how many how much memory you have and how many peas you have here it can figure out what's the best way to do the mapping so it can optimize the energy efficiency and overall you get a mapping that you can run and that's energy efficient so with that now we can compare our row station in their flow against other existing ones we do this computer simulation assuming the same amount of total area across all data flows they also have the same number of PE s and we use Aleks net layer configuration for this simulation with a batch size of 16 so first of all we compare the different data flows in the x-axis here and in the y-axis we have the energy normalized to each Mac operation the first thing you'll notice here is that for all different data flows they all have the same amount of aou energy consumption because they all run through the same amount of operations and what's really defining the final energy consumption is how they access data from the memory and as here we can see rotation is only one that has the most data access to the ratio file in the end has the most energy consumption in the retro file but that helps us to minimize the overall energy consumption and one example that's interesting to point out is the no local reuse one it actually minimized the amount of DRM accesses but in the end is still put a lot of access to global buffer and that's not going to optimize for the overall perform energy efficiency and that is why our rotation airflow can get and 1.4 to 2.5 times lower energy than the other data flows what is the access so when we show that in the earlier slide basically we profile from a real technology of the 65 nanometer technology so we assume that for retro file access to that account for like 1 unit of energy before knock that's around 2 for global buffer that's around 6 and for DRM that's around 200 so there is a hierarchy of different energy cost and what we do is trying to with this cause how to optimize a data access to each different levels so here is the same data that we presented not in terms of the breakdown in different hardware components where we break it down in terms of different data types so for Waystation the airflow it's clearly showing that it has the lowest energy consumption for accessing weights for a different variation of output station in their flows it has the lowest energy consumption for accessing outputs but again the rows issue their flow is the one that optimized for the overall energy efficiency and that's why it gets the best results so with the ROS assuring their flow we build a actual architecture called iris v1 to support that first of all it used roast assured air flow to optimize for daily use as what we just saw it has a few more features like we use a very flexible to improve the P utilization we also use leverage the sparsity the data to reduce the processing power and Douce amount of different traffic further so here's a top-level block diagram of our iris v1 architecture on the right you can see we have a 12 by 14 array of processing elements and they can talk to the global buffer here in the middle that's around 100 kilobytes through some unship network the goal buffer can talk to the off off trade DRAM through some data buses and you'll notice that there are some components here that's doing compression and decompression which we'll talk about in a minute so the first question we have is that the rotation airflow actually map the data in a very specific pattern so we want to have the number of rows of PE equals to the filter size we want to have the number of columns for PE s equal to the output feature map size but now we have a fixed size of 12 by 14 array how can we map different shapes onto the fixed P array so we develop some very flexible mapping strategy to deal with this problem the first strategy is called replication to here to give you an example the LX net layer three to five the filter size is three by three and the output feature map size is 13 by 13 that result in a three by 13 P array that we likely use that is much smaller than the physical P array we have here which is top by 14 so what we do is simply replicate the structure four times to fill the P array and we can run through different feature Maps or filters in different replications so we get higher performance the second technique we use is called folding so one example here is a second layer of the x net which require a 5 by 27 P array that's much wider than the 12 by 14 we have here so what we do is we can cut it into half and fold it on top of the 12 by 14 P array so we can still run through the process without sacrificing the performance so now we know how to map the operation on top of our architecture the next natural question is how do we deliver data to them specifically because of our data flow we have patterns like delivering filters horizontally delivering the feature maps diagonally the simple solution is to simply broadcast the data to all the PE s and let the P decide if you want to take the data or not but that's going to be very energy consuming from our experimentation so what we develop is actually called a multicast network basically what it does is if you know you're only going to send data to those P on a row you can shut down the data bus in the other rows and that similarly can happen for a field awaits feature map here so by doing so we can still provide any data delivery pattern but at the same time we save over 80 percent of energy by doing a simple broadcast and something we notice in the data is that there are a lot of natural zeros in the feature map so we also want to exploit that property and since we know if the data is zero you do multiply and accumulate that still result in zero so you don't actually need to do the processing so what we do is when we encounter a zero in the PE we simply shut down the PE for that cycle so you don't have any switching power in that case we can save over 45% of the P power with this type of simple logic and the next thing we can do is we can actually leverage those zeros to compression and we use a very lightweight compression to make the data traffic through the ran much smaller so as we can see here we can get 1.2 to even 1.9 times fewer traffic to go to the UM so with all those features we tape out the iris v1 architecture into a real chip its fabricated in a 65 nanometer process and we do a measurement and what we see was that we can achieve around 35 frames per second performance for the conclusion net commercial layers of alex net at the round lower than 300 million what's more importantly to support those processing which involve around sixteen gigabyte of raw input and 4.5 gigabyte of raw outputs we only need to access the global buffer and around 200 megabytes and then D rent for around only 15 megabytes so here is a quick comparison between the iris v1 chip to a leading GPU at the time of this measurement which is around late 2015 so the two architectures they use a similar amount of resources while the GPU use a much more advanced process but what can what we can still achieve is around 10 times higher energy efficiency and much lower access to D Ram so now I'm going to show you yeah it's a hole GPU yeah this is a small like mobile GPU yes so we further take our chip and integrate it into a real hardware system to do a actual real-world application for image classification what we do here is we fetch any image from any places like in this case a dog image and then we're going to send this image to our assistant for processing so while you'll see is now the image is sent being sent to the our iris chip and doing this deep community net processing on the chip here and then they send it back and it says it's actually classified as a dog what this shows is the capability of the chip to be flexible enough to support real-world applications so to summarize our work on iris b1 first of all we proposed a Kasana me of data flows that involve three major categories of data flow to categorize existing work that's going to provide you insight into understanding different works and with that we further proposed a new data flow called row stationary the idea is to optimize for the overall system energy efficiency which is why we can get around 1.4 to 2.5 times higher energy efficiency than other existing data flows and with that we propose our areas we want architecture it use the rotation airflow to be very flexible and energy efficient in terms of optimizing data we use we leverage sparsity in the data to reduce PE power and reduce deer and traffic so it's more energy efficient we also use a very flexible mapping scheme to improve the utilization of the processing elements so it's also flexible improved performance and last we implement a network that can support any delivery patterns so it's flexible but at the same time it's not energy consuming as a broadcast design so we also improve the energy efficiency in the design overall a translate to over 10 times higher energy efficiency than a mobile GPU and we also further demonstrated and real image classification system with our chip to demonstrate the flexibility so you the question is can we apply the row stationary there flow to a GPU architecture so the idea of the rotation is that you you need to have the hardware that can actually leverage the potential of the rotation they are flow so the GPU architecture is not necessarily to decide for exploiting the re use so for example we need to have specialized network and we have need to have like local storage that's low cost enough to actually leverage those potentials but a central idea I thought was that you're making better use of the data in the local storage and even that idea is not easily translatable into GPUs ok so I think it's really important to have a code design between the data flow and the hardware to actually leverage your full potential yeah so after we've done with iris b1 we actually noticed that the field is moving at a very fast pace so we actually have done some serious survey in terms of what's happening in the field and we put those knowledge into a proceeding paper here on efficient processing of DNS and through this process we actually recognized some limitations to many of those existing approaches to design specifically what we notice is that the deep neural Nets nowadays they're becoming more and more compact through the idea like filter decomposition we break down the large filters into many smaller ones or like bottleneck layers to reduce the complexity we can make the network more compact for example one net is called mobile net that's proposed last year it has actually higher accuracy than Alex net the atom much lower number of weights and amount of computation but the question is does that translate to higher performance in the hardware I think one thing we keep saying today is they are we use we want to have data reuse get better performance so we're examining that they are reusing those newer nets unfortunately it's actually going against our favor while we're showing here is three different nets throughout the ears of development so toward this direction we actually get more and more those compact features on the y-axis we have the amount of data reuse basically saying how much Mac operation that each data can support and this is showing for the improve feature map for each data point that shows that they are reused for one specific layer in that nets and we have a rep point actually the median value of the data reuse in the same Nets across the three Nets here you'll see that the average amount of data reuse is actually going down and another thing is the variation that we use is going up in your nets that means we have to deal with not just the low amount of areas but a wide range of data reuse in the architecture so what does that mean to our Hardware performance here I'm going to show you one simple example to illustrate that suppose our hardware is built to exploit reuse for weight so we have for Pease we send out the stem weight to for Pease to do processing with different feature Maps suppose today we don't have enough reuse let's say for each weight we can only reuse that with two different input instead of four in the first case if the data flow is not flexible enough to actually schedule other operations on those two p/es you have idle ones that hurt your performance but in the second case even if you have a flexible data flow to schedule out the operations let's say we schedule not all w1 but w2 for the other two PE s because we only have a broadcast network here that means we have to send up u1 in one cycle and w2 in another cycle so we don't have enough bandwidth to provide the data either that also hurts our performance so overall if we don't have enough reuse in some existing designs the performance will go down so our idea is in order to mitigate this we need to have a very flexible mapping strategy involve two things first of all we need to have a data flow that's flexible enough to avoid idle Pease but at the same time we need to improve the design of the new on the network to deliver data so we can exploit data reuse but at the same time we can provide enough bandwidth specifically there are those three different cases we'd like to support for example when we can get a lot of reuse for weights we can send that data to different pease to get data reuse with different inputs but when we don't have that much reuse then we can get more ways to do the processing to feel other PE s and then in the ultimate case that we just don't have reused for weights then we want to be able to provide enough bandwidth to supply the processing to keep up the performance so what you'll notice is that there are four very specific delivery patterns of the data we want to have unique has for delivering high bandwidth we want to have broadcast to support high data reuse we also want to have something in the middle like for broadcast for multicast so we have grouped multicast here we modoch as in different groups we have interleaved the model class here as well so they all come in pairs so if you have unique as in one data type then you have broadcast for the other if you have maracas interleaved fashion then you need to have the group as for the other as well the question is do we all we need of them for our architecture so let's take a look at how we deliver our data is specifically improve feature map in the rotation rate data flow first of all I'm going to show you the case in Aleks net and showing two charts this one is how we deliver data horizontally in the P array and this one is how we deliver data vertically in the P array each pie chart is showing that how much data is actually being delivered at the specific fashion either unicast multicast or broadcast as you can see here all three type of data delivery are heavily exercised in amount of different layers in the same neural nets furthermore we also have the result for mobile nets as you can see because the amount of data we use is less immobile net so we actually have to do more unicast in this case but overall all three types of data delivery they are still very important to optimize for performance and energy efficiency so we need to be able to support all that so one going our design of iris v2 is we want to be able to support a wide range of data reuse if we have a high amount of theory use we can leverage the idea we proposed before to exploit them but when we don't have enough data reuse what we want to have is the data flow that's flexible enough to minimize the number of idle p/es and we also need to have the network on chip that can provide high enough bandwidth specifically you need to cover all different delivery patterns so the second thing with notice from the network nowadays is that they actually can be pruned a lot that means in one case and that's developed in our lab is we prune the Alex net so we can still get similar but a much lower number of weights and computation compared to the dense 1 and we want to be able to leverage that for better performance and energy efficiency so the second goal in our second version of IRAs design is being able to support a wide range of data sparsity specifically we want to leverage the sparse feature maps and weights and translate that to higher performance and energy efficiency but due to the time constraint today I'm going to focus only on how to support a wide range of theory use in this talk so first of all we want to improve our data flow as you remember we have a data flow that's very good at exploiting data reuse the thing we want to further improve is how do we minimize the number of idle pease so there are two main ideas we propose here the first one is we want to be able to map data from different dimensions actually from all different dimensions to different pease for example in this case here if we have many input channels we can actually map data from different input channels on top of one specific dimension of the P array but in the case when we don't have input channels we don't want to waste the P array to do nothing so we can replace that with another different data dimension for examples different columns or future maps to further fill the P array and the second idea of this data flow is that we want to be able to map not just one but multiple dimensions of data onto the P array at the same time so for example we can break the P array down into blocks for each block we can map data from a set of different dimensions and then at the block level we can mix and match another set of dimension so by doing so you have a lot of flexibility to choose which data from which dimensions you want at the same time and so overall you can maximize the amount of utilization of the Pease so by doing so this new data flow which we called it row stationary plus you can minimize the number of idle p/es so you can get higher performance at the same time is actually our superset of the rotation there flow meaning that all that mapping that rotation can do the rotation plus can also do as well so you always get the same or even better performance out of it so now data flow is not an issue anymore the next bottleneck becomes the on ship network design so here I list all three common designs amount existing work first of all we have broadcast network so even though we say the iris v1 network is a multicast one in terms of the source bandwidth is still the same as in broadcast network so in this network you get high reuse but you go you don't get high bandwidth from the source alternative is you can do unique as network in which case each one gets own data source you get a much higher source pantless from this one but you really don't get much reuse so to get the best out of those words we can have this all-to-all connection which you can leverage both high reuse and high bandwidth at the same time but as you can imagine the cause of this will grow quadratically with the size which is really hard to scale in real implementation so what something we remember from our architecture class is that mash network can be a good solution to this so what we do is in between the source and destinations we use routers connect them in a linear fashion so it can easily support high bandwidth modes it can also support how we use mode by using a router to send data to everywhere but in some more complicated cases here for the group multicast we can support that by separating them into disjointed sections however for the interleafed multicast mode we have issues pointing that because in the case here we have a bandwidth limiting route you either have to suffer for a lower performance or you need to implement some flow control logic that's going to make the cause higher so both ways we don't like that that's why we propose a new design called a hierarchical mesh network the idea is that we want to be able to support a wide range of data delivery patterns at the same time we want to cause to be low so we can easily scale the size this is the idea of the hierarchical mash network basically in this case we group two source into a cluster two routers into a cluster and to destination into a cluster so between the clusters we use a the simple mass network to connect them and within the same cluster we use all - all network to provide enough connectivity but since the cluster size is usually much smaller than the full size of the array the complexity can be welcomed 10 so now we can support high bandwidth mode fairly easily we can also support high reuse mollify sending data from one source to all destinations for a group the multicast we can leverage different clusters to do different groups and then for the interleaved multicast mode we can live to the higher bandwidth between the clusters to deliver the data as well and this is a case where we interleave in the number of two if you want to interleave more you can simply scale up the cluster size another thing on a point out is that all the routing can be done at configuration time so you don't need to figure out where the data want to go at the wrong time that means implementation of the router can be simply a circuit switch design they contain only maxes they're really cheap to implement so in terms of the scalability of this design first of all we just determine the size of each cluster based on how much you want to interleave the data and what's a tolerable cause you can tolerate in this auto or connection and then with that cluster you can simply scale that linearly so the cost well scale also linearly and compared to an auto all design the cost will scale at much grace grace level so with this idea we proposed our iris into architecture this is a more distributed architecture as you can see for the sources we have our global buffer cluster for our destinations we have our PE clusters and they are connected through the router clusters in a mash fashion so some typical numbers we use in this architecture for a global buffer cluster each one is around like 1011 kilobytes for the P cluster what we use is normally around like a 4 by 4 P array and at this type of configuration what we see is that the network only consume around 3 to 5 percent of the total energy this is a very low cost and you can imagine when you scale up the architecture the ratio will roughly stay the same we compare this to a all-to-all network which the cost will scale quadratically this is clearly a more economical and efficient design so with this your new network architecture we've done some simulation to compare what's actual benefit in terms of throughput and we compared that to our old design iris v1 for this simulation post design I have 256 bees in v1 that's the simply 16 by 16 P array but in v2 we have 16 clusters each was 16 PE s and we use batch size of one to exercise a lot of those coordinate cases so first of all this is the throughput speed up from iris v1 to iris v2 in Alex net so remember that Alice net is actually a very large net so it actually has a lot of theory use in the convolutional layers so we don't get really much improvement like 1.5 to 2 times improvement in throughput for convolutional layers but the benefit really kicks in when we run the fully connected layer here because in batch size 1 the weights has absolutely no reuse at all and that means we need to provide really high bandwidth to supply the weights and we want simply cannot afford that and that's why we can get much higher improvement in throughput another that we're in is the mobile net which is much more compact design as you can see even though in convolutional layers because it's a more compact design we constantly get a much higher improvement in throughput that's around 3 to 5 percent a list in some layers we can also get much higher subbu speed-up because those are the depth wise convolutional layers in mobile Nets they have two input channels they have no output channel that means you need a really high bandwidth to deliver the improve feature maps and v1 also cannot keep up with that and that's where we to actually get a much higher throughput improvement so overall we average the amount of improvement across different networks and at the 256 PS we constantly get around 10 times or more speed-up than v1 but when we scale up the number of pease we can see that the number of speed-up also increase much more that's because the bandwidth of v1 simply doesn't scale and we have a much scalable design for v2 so overall we get at least 10 times speed up than v1 so to summarize what we've done in v2 first of all we want to support a wide range of data reuse we've done that through a new data flow called row stationary plus to minimize the number of idle pease and then we propose a new network called hierarchical mesh network that's going to provide high bandwidth and also exploit data reuse so overall that translates to over 10 times speed up they always be 1 the second thing we didn't have time to talk about today is how to support a wide range of data sparsity the basic idea is that we can process both future maps and wait in the compressed format directly and that can help us to improve throughput and energy efficiency but we can also adapt when there's no sparsity we can process the data directly in the raw format so we don't incur the compression overhead and from that we see LS additional 3 times speed up from sparsity as well so that summarized RS v2 and to conclude my talk we know that processing of deep neural Nets is going to become more and more important because we're going to use it more in our lives and once you've recognizes that being able to optimize for data reuse is the key to achieving high energy efficiency in order to achieve high performance we need to be able to have a high utilization of the processing elements at the same time we need a network that can deliver the data and we need to done all this in a very flexible manner because we have to support a wide range of different neural nets and also they are constantly evolving and this can be done the co.design of why you see here today dataflow and arcade and hardware and then we can optimize for the three pillars the performance and efficiency and flexibility now overall if we can achieve this balance that can open up new opportunities for applications in the future so that conclude my talk and I want to knowledge a lot of people that carry me through this project so first of all my advisers Viviane and Joe I think being able to participate in the iris project was truly amazing journey you know like from day one they have being so devoted in this project and that's what pushed me through the project and most importantly I think they're always having like the most confidence in me that I never have so that's how I can carry through this project and done something I could never have done without them and so I really acknowledge want to acknowledge that and also I want to thank my committee member Daniel I know this is a very busy time for you and very thank you to serve me on my committee and thank you for your feedback for our paper as well as you can see I want to thank our collaborator Tushar here he basically came up with the idea of the network in I receive one like one month before our table so without him we cannot have the chip done so next I want to thank all the members in our EMS group I think I'm very lucky to see how the group getting stronger every year with all like great people from you and I think the other thing I'm always surprised is like how well we compliment each other in terms of our expertise and how well we have different like experiences so I really want to thank all of you guys and the most important in my life is probably my wife Sasha I'm all alive and kicking today like because of her and even though like I all sparking where ours I have a lot of weird constraints and I really really thank her for accommodating all of that yeah hopefully hopefully in the end I want to acknowledge all my friends and family I think they're very essential in my life at MIT so make them fuller and make them more minimal so I was thank all of them so that conclude my talk happy to take questions thank you [Applause] but there's some theoretical limit but where are the opportunities so I think so the question is what's our what are the future opportunities we can leverage from our like knowledge right so I think one thing with notice is that we have a malleable knowledge in terms of how we like to see the how we're evolving but not necessarily the network are doing in that way so we want to translate our knowledge in how we're back to how to do the network design so together I think they have a great opportunity to further improve all the performance and efficiency down maybe another even order so I think they're a great opportunity there as well so basically we can leverage a single idea from iris we want to build a compiler for iris v2 as well but now we can actually incorporate more constraint like not just for energy efficiency but also for performance as well and depending on how you like to optimize for two different things you can build a compiler for that oh we have it actually if you had you know so if you had headgear obviously 110 years ago maybe that would have happened even networks would be more regular and easier anyway yep you do you see that is there likely to be a better kind of network that can emerge from your eyebrows up and you see where it's gone I think yes I think the answer is yes so what I see what that's happening is because like most of the how were used nowadays are like GPUs and CPUs they don't necessarily can exploit like the data we use we've talked about here but nowadays like a lot of people are working on like like custom hot wares and you're going to see more and more people using that and apparently that's going to propel like the development of the neural net as well so I think that takes time but I think that will happen in the future so we all I could see like it becomes smaller like with fear wait in last computation but we if we can do that in a way that's not really the tremenda in terms of how we do theory use because that's really fundamental how we can achieve a really good energy efficiency so we can achieve both at the same time I think that would be really great yeah that's a good question so we actually yeah the question is what's the benchmark we want to use to effectively evaluate different architectures does that capture the question yeah so that's actually our long-term question throughout the years and we actually try to advocate for people to use a more complete set of benchmark to evaluate different architectures so you don't just look at one aspect and what we've seen is that certainly you want to evaluate on a wide wide range of set of different neural nets that people use a lot and certainly what we've identified that and there are certain corner cases like depth wise layer is fully connected layer they all have very different characteristics and I think it's very important to test across a wide range of different configurations in terms of the shape and size of the TN layers in order to get a full idea of how the hardware performance so the question there a lot of existing techniques like polyhedra model in the contemporary compilers that try to also improve like they are using different algorithms how do they relate to like this kind of work right so I actually looked into polyhedral compiler at the very beginning of my research what we found is that peligro compiler first of all they talk at very specific they work on very specific architecture you run the compiler when you know that what the architecture will look like but even with that first of all that's still a large space to search that's a actually a dimensional space that's a lot of space to search for and furthermore if you can actually Co design the hardware that open up like infinite like opportunities so what I really want to do is get some insight into how we can actually do to do this well so if you know exactly how the how it will perform like like if you know like speak like I just use the CPU architecture they can build a police hero compiler for that specific setting but now since we have the opportunity we open up the opportunity to even code aside the hardware that means that we cannot simply just apply this we don't even have the what type of architecture we want to run this on right we're and you know once you have their architecture you mentioned well the differences that we open on that architecture so you can somehow think of that well the question is if now we come up with architecture can we utilize the idea for the polyhedra model I think our compiler is actually doing similar thing to what the polyhedral compiled please your model is doing basically and that's for our architecture okay so yeah the question is is there anything special I like to talk about in how we design this bars architecture so it's actually a code design light in terms of a collaboration between me and a group member they're moving our group so we have a model to actually prove the model the network model so that can run very energy efficiently under hardware so it's not like you prune the model to target just the fewest number of weights but we want to prune in a way that it can run nice on top of our hardware and then our hardware basically it's a new design of the processing elements that can taking the compressed format of the weights and feature map directly so when you do this type of processing you can skip the zeroes cycle directly so you can improve the energy efficiency and processing throughput that's basically like the high-level idea of this design or is it just kind of the RTL babysitter's things located basically where you wanted them we didn't so the question is do we do any special tricks in terms of that like the back end flow to make it better so we didn't really look into that direction so we use a pretty standard digital flow I think there is a lot of opportunity in terms of how you can further tweak the circuit down I think those are the optimization we didn't really look into during that time yeah yes the question is are there other type of problems we can leverage this type of architecture I did actively look into other problems to solve with the this architecture I think something similar like people are trying to solve like a graph problems I think they're opportunities like there's also sparsity in the data you can very similar ideas other than that I didn't really have much but I think graph is some possible applications [Applause]
Info
Channel: EEMS Group - PI: Vivienne Sze
Views: 6,300
Rating: undefined out of 5
Keywords: Deep Learning, Deep Neural Networks, Eyeriss
Id: brhOo-_7NS4
Channel Id: undefined
Length: 69min 8sec (4148 seconds)
Published: Tue Aug 07 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.