Song Han's PhD Defense. June 1, 2017 @Stanford

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
that's how colleges they're really great student one of the best students I've had the privilege of supervising in four years and I'd like to admit being a faculty member in Sector over thirty now between MIT and Stanford it is much great afterwards he's very broad he has expertise that ranges from VLSI circuits to computer architecture to artificial intelligence and by spanning those areas they would make connections apply techniques from one area to the others and do very innovative things he's also really smart as book with a bunch of great insights it has worked out problems and some people have been working on you know since the early 90s and this made unique insights been able to improve upon the previous results because because of the gene capability to see things in a different way because of the tremendous amount of leadership capability he acts as a mentor to sort of all the junior students in my group that actually did a bunch of people and others beautiful people other students other people groups as well and it's very effective in helping to disband to getting them to move along with their research program and above all he's just a really great person he's fun to be around is kind of very considerate of others and just a real pleasure to be with whether it's a student or an eight-way sort of a long journey before we arrive at his PhD topic the theme of my research group is to try to improve the energy efficiency if you gain a lot of the current projects and to do that through specialization and so when songs started in the group we tried to look at something that you could specializing because we're part of the platform lab we'll attend a different name at that time we looked at John Rose travels Ram College System he probably spent close to the first year of his ph.d program trying to figure out how to build specialized hardware rempod and actually think that it would have worked but we got the bottom down in detail he was actually important learning lessons to not dig into the leads you figured out what the forest is here that you're dealing with we then spent time on some other data center applications with the aterna specialized and attributes and stuff then finally we sort of stumbled onto any to actually grasp to at one point then we sort of stumbled on the machine learning and that's what we sort of making a lot of progress there's been very little done at that point in time most of purpose partner machine learning it was the opportunity to sort of co.design the algorithm with the heart we're just building hardware to explore the current algorithms wasn't very interesting people kind of done that but but by making tweaks to book the data representation into the problem itself where it would make make big progress there and so so we'll tell you all about that and he passed the test when a student is ready to graduate in that used to be when I was today I was so long I would be teaching him with that when I said that was song he teaches me and so he's ready to go up and be a professor himself I guess at MIT if things turn out thanks Bill thanks Bill for the kind introduction grateful years with Bill and also thanks for my committee members making the trip after from all over the valley thanks for your time being here so I'm trying to be here talking about efficient method and hardware for a deep learning today so as we see neural networks deep learning has created a lot of new applications in the past five years for example self-driving cars the machine translation in our Paco or smart robots so some trend for such deep neural networks are recently in order to have a high accuracy a Monica Anthony the model and a larger enough project freedom of the internet greener the size of the winner has increased by 30 Mike's 2015 resume and also for speech recognition network within one year between number of twinning operations increased by an order of magnitude just increase decrease the error rate by about three percent so you pay a lot of computation just to get three percent decrease of an error so such large computation memory reference creates some problems to deploy such deep neural nets really either Ondina sent our mobile phone there are several challenges the first one the first challenge is that the market modifies make it difficult to distribute for example around in the mobile apps your app is 100 and 100 megabytes you cannot download until you can to Wi-Fi and also for self-driving cars say you buy a Tesla and then you want to have those over-the-air updates for the deep internet models if the model are huge is difficult to tunnel from and the second problem is the speed twin in such deep neural nets our rent interest on commercial during that new deep neural Nets can be very slow for example resident by 52 using for unn for the diffuse when it has a visible I took one point five weeks one point five weeks between and just increase the accuracy by 0.05 percent compared with one so it really limits the productivity for machine learning researchers to prototype and explore new new deep neural network architectures and the third Italian is the energy efficiency so you know how much so our goal between this other last year it took 2,000 views at 300 PP o--'s and you know how much you've actually paid is three thousand dollars it click actually per game so this greatly amidst the deployment the feasibility of the points are diplomats either on the phone or on the large-scale data centers because if truing is the battery and increase the total cost of ownership on a large-scale data set I'm a pointing to Google TPU paper if every user using speech recognition in their Google's voice research forum in three minutes per day three minutes per day before without TPU who has double double durability so deep neural Nets is a very important work load so let's see where exactly does these power or energy consumed on the Internet so here's a closer look the larger model means more memory reference which leads to more energy consumption so this is a table for different operations for and then 45 nanometer technology and you can see that the memory of reference memory reference is more than 2 or 3 orders of magnitude more expensive than arithmetic so you do a lot of memory reverence you can do a thousand multiplication added with a paper so it is those memory reference and margin model used to mark member reverie that hurts the provisions of convenience so how to make it more efficient I'm proposing to improve the efficiency of deep learning that is algorithm and hardware code design so previously when we were designing hablar the trip always another black box for example spec 2006 and then offline to see to your point of running the benchmarks so now we should is to open up the box see what we can change on ours in first and then off by the hardware that's your FPU to see what is the best poplar or Internet so it's greeting the boundary between the algorithm and the hardware so let's took up ripping and look into what is in the box keep running wild one for some basic concepts so it is large field in effect and more than other example GPUs enable different so between the people that are large skill did as thick as that for example image map Amelia's most state-of-the-art GPUs and then within a model a the convolutional neural nets with criminal nets or large short-term memory and then after we get the model we give it an input data but happen on those either our embedded hardware on the phone or on large-scale data centers with the ion cloud so this is the basic setup so previously the approach the approach was after the trainer model your retainer model and the right way the right height deploy the model you train in France since we are using the conlangs so much optimization method to cope with such highly non-convex problem so there's this Richard say there's great dependency the model you clean in order to have a great local minimum so what I'm proposing here is to model compression between training and inference to make a model much smaller much fewer outsmart appeal memory reference and then deployed our special icon since comparable that brings the irregularity so there are lots of considerations how to traverse those sparse in unit a level of indirection those irregularities brought by normal pressure so the overall goal is to make a bastard how we efficient and doesn't sacrifice the accuracy so that's my target so the goal for my figure of Merit and the goal of his talk will centralized on these four factors to the hauler model astrid inference more energy efficient delivery and more accurate models and this is a tender for today's talk I'm going to start with model compression by making the model smaller followed by two Hudler accelerators that camera inference the right light on we compress the model followed by the efficient weaning how to make the model more angry by such regularization techniques I'm going to first start with model compression so that's the first restore my little bit how can we comprise the people back so if you think of it there are roughly two ways one way you can reduce the number of connections you keep neuron that you have a 10:00 meeting frame terms you can reduce the number of practice a second that you can reduce the number of bits the number of bits per wait so that they multiplied together you get a smaller model so I'm gonna first introduce learning both ways and connections for efficient you know network that targets having fewer number of connections wait this is an illustration of pruning your deep neural networks so originally we have a bass model every connections are trained as it is so the linear network is like pruning a tree where you get rid of those redundant connections so this has been first explored that that professor Yana compacting 1980s and 26 years later I visited this problem even the hardware efficiency perspective and also our model ignorance so this is how we prune a deep neural network between the model I think it is before and then we prune those connections that are dependent based on a very simple heuristic of its absolute value if the absolute value is smaller then set it to zero and then we don't have to compute 0 x emitting is 0 so we don't compute with a story finally we tween the weights trim already Remini weights to recover the accuracy profile a chrysalis brought by pruning and we can do this process effectively a more parameters doesn't necessarily mean more accurate sometimes it's used to overview so even sometimes we observe the accuracy improved during this pruning process and let's see how does such pruning hurt or doesn't hurt the our prediction accuracy if every 1% of accuracy comes with the endurance is how much predator we threw away the y axis is the loss of accuracy incurred again you can just prove that your networks like you mate we can see if actually occurs in the interior it's very shortly and that 80% the loss of accuracy is about 4% that is intolerable how can we make it better we can retrain the Romanian weights the Romanian connections so that we can see the accuracy when we prune 80% is lost is 0% yet fully recovered what I recover the accuracy oh but we do this process each recommended pruning a trimming pruning and thinning we can push the frontier even further 10% from Eurasia so this is some common opinion on that sonometer classification how about there's also other neural networks such as return and non short-term memory how does pruning work on those neural Nets so I experimented with neural talk from the baseline order to see you play the accuracy drops very quickly for this image caption watch out for memory but with proper retraining the Green Line the blue score doesn't hurt until you prune away 90% of the factors similar observation or both convolutional net recurrent neural nets in hospitals and here some visualizations of how pruning not short-term memories and in fact the caption quality this is the original image the caption in from Europe of business as a basketball player in the white uniform is playing with the ball versus you get rid of 90% redundant parameters is that a basketball player even what uniform is going to the basketball syndrome for the second image of wrong the baseline you know top models as but oppressively say 95% of the front that's our move that's work is going to drop a drunk is that a man in the red shirt and black and black black circus running through a few classic so if a language model is not okay so let's see a little bit how does the pruning affect the weight distribution so this is the original weight distribution the small the small weights are removed that's why there's a bit here after retraining the Romanian weights the distribution gets softer and that's why the accuracy gets fully recovered so here we can see roughly the weight has been into two pockets a positive bucket inactive bucket so what further thinking is can we do we need to continuous distribution can we further in those ways several pockets that's the next ideal tween the quantization to further compress the compression compressing deep neural networks with pruning to implementation and Huffman coding and intuition this person you don't need those high precision weight to find my tool for example this one hasn't - so the process to do this is by clattering the weights by k-means clustering similar weights together generated hopefully from tide the waves with the code so that you only need to store the index and then between the code book and do this process III and with this as a result you can quantize on to keep neural net from a little bit to own it for bait to only forming of a time-saving this is an illustration for 4x4 matrix of weight in with matrix format for previously represented 32-bit floating-point numbers I'm going to maybe is offering the quarter say to an online tool the you are 42 and now what we need to store is only the index index into this tail which requires only two bits in this case instead of 32-bit we have any foresight time so this is filled power how about backpropagation how do we treat such weight share so after the death of readings we cover them with the same - as the weights and then to a group by operation by having the same gradient with the same color group them together sum them up divided by the numbers have you reduced operation multiplied by the learning rate and subtracted from the initial sanctuary and that's why he threw a shoe of stochastic relief designed on such winter shared deep and let's see how does we distribution change remember previously we have a continuous weight distribution now after which surely we have that discrete such discrete weight 16 16 bars which can be represented by only 4-bit comparable so 32-bit previously used and we can up our through twinning retweeting others with shared network you can see there's a subtle change there is a subtle change of the way which compensated for those one to three percent loss other matters okay now let's see how many bits how many bits do we really need to put either cover the accuracy of karma layers with six bits in this same to putting our preserve the accuracy from nothin to four bits four bits as the accuracy begin to deteriorate put absolute leaders for grabs Adair don't thank you two bits not into two bits that's the accuracy if you need to talk so far we have seen two methods with pruning and training on division the next question to answer is do they work well together so this is figure showing how much how much can we compress the model versus how much accuracy to be loose so by combining pruning and quantization by combining them together we can make that will prevent three percent of its original size without hurting the accuracy welcome comparing with pruning and compensation working individually with 10% of its model size a cuisine already begins to drop and also compared with the chip as we need factorization based the method which 90% accuracy already observed we observed 1 percent loss Oh finally there is a third idea contribute to the loss of 30% of sitting in the story which is comically a lossless compression method works on the remaining weights the idea is very simple we can use more number of bits more number of bits to represent those you frequently appear weights and use less number of bits the last number of bits that those more frequently appearing weights it saves another 30% of the storage of so by putting them all together this is the deep compression pipeline cases of who need less number of weights between the position that's less number of bits per weight and finally Huffman coding and the result evaluated our source of deep neural nets did you do that Google that resonate the compression we shall read it from temporal context 49 X the accuracy gets fully preserved and it's not not the present to see those networks you will not regret neck which have only one fully connected later have smaller compression ratio because they are more efficient to begin with but they can still be comprised by an order of magnitude okay so the next question to answer is so we are taking existing model use this in order to compress it can we make compact models to begin with and then compress it how does that work so I collaborated with Berkeley and tested the idea of this book on squeezing a squeeze net that's a very efficient architecture that is only four point eight megabytes compare it to 40 megawatt using a squeeze layer to shield a commutation expensive three by three convoluted with less number of channels so she a little bit last number of channels and then expanded later so by using such building block is large our deep new network and actually accuracy compared with panik's net squeeze that is 50 X 50 X smaller the same accuracy so applying compression on squeeze net I obtained a model that's less than half a mile away that's the half of megawatt pretty easy to fit in the l1 cache ultra fashion so visually is about 500 X 500 is smaller than connectors which shows even those are compact model it's still amiable to compression okay so smaller model means it's having less memory footprint and more cache friendly so I planted those foolish natural layers it's a simple CPU GPU and mobile GPUs using the last layer of a lesson that we did in that neural talked to see how much speed up to weaken from such compression about 3x saving on the cpu 3x on the GPU and about twilight's under speed up the mobile TPU and again for the energy efficiency it's about 6x actually energy efficiency the relax of the GPU and 4x better energy efficiency and the mobile review for these very connected batteries so the compression has been adopted applied in industry it asks for my do we example by the wallet we don't want to recognize your credit card go on to sign your credit card to the cloud really want to recognize everything locally and also Facebook we're at the time the enterprise and also go but you could not accomplish here is example Facebook we our prototype which is backed by deep compression which the image out that intention having their location both the either it's a coffee cup so that we can do the features according to what it is and where it is and achieve 8 times model with size reduction by hip compression since Facebook want to have such fun apps on your local phone instead of having to rely on the network and it appeared on Facebook Developer Conference 18 is this time okay so we have so far introduced confession which seems very powerful thing but it actually creates lots of challenges for harmony for example how do we do those online the confession while running inference and also the computation becomes irregular agree sparsity in the weight Sparsit and also how to take advantage of spot activation how to deal with those in director lookup and television also becomes challenging synchronization overhead loading that is usual in scalability so having open up the box now we are in this domain to look at hardware so what is the right hardware architecture specialized employment such comprise deep neural Nets and I'm going to introduce celery focus on the accelerator yeah is that for effusion difference engine comprise the deep neural network and there are not related works for accelerating large build your networks they have share a common feature common goal to minimize the memory access which is intensive for example I read from an IT propose the most efficient a data flow to minimize D run traffic and data now from China Academy of Science are using an e theorem to offer all the weights on chip and the reason GPU on from Google is using 8-bit integer 32 the working point to minimize the give an answer and in particular in the GPU paper and says that unit is designed designed for special matrices supernes architectural support was omitted for from time to employee with parceling will have a high priority in future designs and my work yeah he is the first that you can doing that network accelerator that support the sparsity and also it's compressed deep neural network and design more it's simple so haven't been multiplied by zero is zero so don't waste the cycle don't waste it storage and approximation doesn't have to end exactly so by taking advantage of the sparsity in the weight 90% sparsity there is 10x saving in the computation and five access memory footprint so to factor of two coming from index index to power and also that live in the wanted sparse activation we say about another 3x reputation and I will share iam using for bid to represent it way we save eight times mass memory footprint and the good thing is over all of the compression the weight can feel the on chip as RAM which has a much larger bandwidth much fewer about energy compared with having to go option D Ram so let's take a look deeper into the ear and how do we do the paralyzation so this is the 4 by 8 matrix 4 by 8 matrix in given later so we paralyze them a partition in the matrix into loaves the same color of the roll go to the same processor or processing element by TPE here and in this case we have four processing elements for peas and this is a whole physical a we store those comprise the weights only those nonzero weight only those nonzero weight together with the relative index used in forming so forbidden that is because 2 to the 4 is 16 which can encode a maximum jump of 16 which aligns very well with the sparsity of about 10% so in case the jump is more than 16 we apply a petty 0 to compensate for that one we have a pollen pointer so that we can know where is the starting address for each column so this is the data flow for you leaving them general detection circuit so that we can be ignored nine zero activations evacuation only work on those nonzero activations in the record and we broadcasted to the processing element the modification and update update the output another zero we can read another non zero in addition we broadcast it to the other piece single cycle we update the output and you've another cycle we do another set of weights so after everything is done pass it through the level I write it back to the register so this is the microarchitecture of the process element of the Yeti it's a huge looks cute but we can break it down from pipeline stages first we have an activation cue Activision cue to decouple the cup of the load imbalance say one process element we have more number of non-zeros the other may have a less number of non-zeros so we have activation cue to decouple such local dog dependence and remember it's locomotive balance I'm going to talk about the mobile reloading balance later the next responded we access the nitrogen pointer pointer ice ramp to know where is the column of the weights and nights we access the sparse matrix to get the actual nonzero weight in the matrix so the width remember it's just during the index table of 16 entries okay the real decoded wait and for the address for the index we pass it through it which was accumulator okay the absolutely tax and these elements can be done in parallel and then we do the arithmetic multiplication and pad and finally read it back to the register file I'll pass it through the rebel using an Indian answer detection unit that's that's the whole pipeline LT won't hear any PE it's special here load balance peel to decouple the load imbalance which increase the utilization from 50% to more than 90% and also this wind decoder and address accumulator that can be done in parallel to cater for the sparks sparsity and also the level of interaction so we did the layout place the wrong synthesis in prison or under 49 times if all fenomena their library this is the layout of the chip it can handle maximum maximum number of weights on it a comedic citation with 16 ye and the area of 40 millimetres where the power is about half of Papa Papa watt or 600 milliwatts so this is the benchmark so the bank run Maya has VI e number here here even the for handling it with Hardware through compression the number of flops can be reduced by 10x 100x and that's because the neural talk doesn't have read but they use tiny HRC that's why fear is smaller but still well to others of magnitude saving in the number of flops the 1,400 and so we have seen a spirit animal no we added another part of it yeah 189 times faster on these Buddha kinetically activators compared with CPU and around 13 acts faster compared with Maxwell and even more faster than the molality and underneath agency while since we say memory reference allowed we move from Iran so this number is huge you're saving comparative CPU of 24,000 is comparison view and about the important acts compared with the GPU we're not compression and if you say if always the case that anything is more efficient than CPU and GPU so this is a comparison of the Drupal with CPU GPU and P raising more so if it's projecting to the same technology known the 28 nanometer other of magnitude better than the peer designs and same for the energy efficiency both 45 nanometer it's another magnitude better than the peer basic designs and I eluded this problem of load balancing issue just now so by having the pipe of the covering the local loading balance I can increase the utilization from about 50 percent to more than 90 percent for these benchmarks itself except for this this one in Euro Top now CM the utilization is low it's about sixty percent so the next question to answer is how could we do better with the load imbalance and then how to be generalized from feed-forward image also support recurrent neural Nets so I'm going to go briefly into the next design that yes yes I'll read her to deal with this problem diffusion speech recognition engine engine for SPARC on our cm on my PDA so accelerating now cm from occupation why we need to do this because recurrent neural nets are the basic building block for all these applications of speech recognition image recognition and image caption machine translation and the visual question has one particular for speech recognition people use Google bot search only three minutes a day using the speech niche recognition we'll have to double how about a number of data centers which means is very important application to work with so we have seen previously we have to compress the model and venerated the model for the inference now to deal with the double planets and good within the design you know and add this feedback loop to see how can we do better how can we do better for the compression and proposes know the balance aware of pruning so what is vote balancing work for me before jumping to that and going to do introduced where does the look imbalance actually come from so this is the matrix p 0 here as fire cleanser animal wealthy wife and they have P 3 only how but one answer inanimate the total number of a cycle is determined by the largest number of nonzero element which is five cycles in this case but you force such constraint on the pruning process by having the same number of nonzero element for each PE then the workload is panicked all three seconds so the overall computation cycle is relaxed and report and how does this constrain a constraint do you are affect the prediction accuracy so this is how much primarily can prune away originally 90 percent no loss no you craters of the error rate and compared with the no dependence of where pruning similar we have observed a similar accuracy in this sweet spot did a lot of experiment hopefully here and how about the speed up previously without load balancer where truly we can be the speed up with the number of non-zeros 5.5 and speed up given a sparsity given the sparsity of 90% theoretically there should be 10 acts and nice but there is indexing overhead and other scheduling overhead we observe the 5.5 bacteria but with no to balance a winner protein actually there 6.2 XP now it's a free lunch from 6.15 expenses for 2x by just enforcing such load balanced computation pattern incompletely talk about the hundred level hardware architecture of the ethics and return it's basically using ie ie as building block previously sparse matrix what you call you Pete yeah I use as a building block those wires Adam into my as a unit so prototype that on the sidings here it's taking I think 2.7 microseconds compared to 14 microseconds by running it through a whole lot shorter memory with a batch size of 32 apparently GPU can handle much larger batch sizes but given the real time requirement of Elias we tested a realistic of 32 bytes or 32 X Factor and 14 X vector energy efficiency although its prototype tightly designed is it which is an RPG just to prototype the architecture ok so far so far we have seen one challenge is memory access is expensive and I solve the problem by using deep compression making the model 10 to 14 that acts smaller but creates another problem which is sparsity indirection loading balance and I solve this problem with 200 accelerated designs with the diffusion is energy efficient and also faster so now let's switch gear a little bit from the humbler back to the training algorithm I switch gear a little bit and we have so far covered everything inference and what about training previously with a comprise the model between severe about the same prediction accuracy which means can we observe a higher higher accuracy for this original model capacity so for the next ten minutes and believe you can use the occasion treatment method by using sparsity as a regular riser so I'm going to talk about the FE that supports fast training forgive your networks so now we are back to all rhythm so we have seen these two leaders by having pruning we can have this farce model how the same I process the dice model the new thing is new thing is we can relax recover those previously prune showing you ready yeah recover those wings and trim everything together trick everything together those graduates get a second chance of the initialization but we got proven that because they are probably not useful in this case but they got a second chance of the initialization together and I actually have higher prediction errors it's like you trigger the trunk first and then gradually add those branches and leaves so we tested this idea a wide range of vision tasks our vision that works coupon that be teaching that resonant 1850 for example the resin error rate Papa error rate decreased from 34% 23 to percent the relative improvement is from 3% to 13% and also I passed it on to make sure is a general purpose method not only works on convolutional neural nets but also recurrent neural Nets Palestine's vision speech and natural language increase the rules for over seven and also collaborated with Baidu hard works on people speech during again relative accuracy improved by five to seven percent so if you have some related work like all top connect they use random prune or no zero zero way random weights rather than pH between e energy we are clearing away those small ways deterministic and also distillation model distillation is also trying to have a student model to learn from the teacher model and have a smaller model how veteranism' and both ESP and this division doesn't incur any change in the architecture and it's interesting to see how you we combine these two methods together as a summary so far we have visited three main points to a presentation about model compression to compress the model of make a smaller less flops less memory footprint and then hardware acceleration accelerators iya iya PSE followed by another for efficient training using sparsity as a regular riser better accuracy so to project my project into this matrix of new frames of allowed trinium the right algorithm hunger and using sparsity to make the model smaller like deep depression and better speed and under the efficiency popular inference I'd to make it but yeah ii and the SE accelerator treating an algorithm citation target higher prediction accuracy and tweening other remains be the future work so in summary of such efficient methods and techniques to be helpful for deployment deploy a deep neural nets or gaming those devices intelligence even on the phones the domes the robots self-driving cars and also enable making more efficient to run AI in the cloud as well an increase of this future our city will be populated with lots of smart applications for example smart cars mobility smart grid heading and big data so hardware will be an enabling factor for such revolution which requires no latency hardware user privacy good mobility and good energy efficiency so out looking to the future of computation we have moved from the PC era to the mobile era now we are in the first era and hope you are as excited as me for this kind of way instead of inspiring cognitive computing research and finally many thanks to my advisers and collaborators and my friends and in particular I would thank Belk for the entire five years of advising me I learned a lot from bill for example learning finding a problem is equally important in solving a problem during my first two years finding the problem I'm going to do here took me a lot of time which urinated the high impact work here they come from a good direction and solving real-world big problem so in the future and also remember last year it is a conversation with Bill in GTC that even acted me to in academia these are for my career and it turned out to be successful in the time to fulfill appointing me to the to this direction and also I'm thankful for mark I remember we first met in 2001 before I joined Stanford as an undergrad we were at the current chair and he encouraged me to apply for Stanford did me into his beautiful campus following years we have lots of discussions give me whenever I was deviating from the road it could be back in particular this compression work for initialization me you should try to I was kind of difficulty closing the gap of last 1% and importantly to look at minimum and maximum in the neutralization rent rather than those random in the organization and those careful thoughts really paid off for project marketing all the papers are the work I'm so also thankful for pay pay actually I'm talking to your class in Keynesian later the deep learning class also with article party except me where you Bratton new direction looking into deep neural nets the marriage of deep neural Nets with hardware imagine marriage of deep emotion learning with computer vision but now is marriage of people nor in the deep learning with harbor which make a big impact on my research path and also remember last other conversation at your office also encouraged me a lot continued of hyper faculty position its academic path so thank you for all this encouragement white adopted me or your group partially also a fact for crystals it has been long time tweaking your class of architecture class and also interacting with their students and we need grant and Daniel Sanchez brilliant domain of computer architectures and lot about wires and help for my research that's what rowing taken taking time this flexible for scheduling my defense and being flexible thanks for all my friends for the support and encouragement throughout my PhD puff I'm incredible grateful for those who have supported me along the long journey my mom and dad not here today but they were the first community that encouraged me to find my passion in the research so thank you I spend five years very happy life for my PhD or with a severe lab and those academic activities and also fun here in the court and bicycling partially those hobby is inspired by Bill thank you all for my friends and my advisers thank you for coming today all the time for questions during the open session of the fence before we go to the closed session training for sparsity yeah so that's a good question I have a backup backup select for that to simplify the discussion I opened it up hard it's a good question but in the paper so this is showing using regularization what's this our recommendation so the opposite evasion this is a fashion line of poetry although our regularization performs performance better so 80% in these last Thomas or faculty also read but after which we need it turn off the auto regulation actually gets better it's all concur with our equation so I use the auto and regularization throughout the work I haven't babysitter so the question is called gating yeah there's been work on clock gating for example iris chip from MIT we can now whenever it observed a zero they don't want to fight or anything and still zero so they don't waste the energy to do that multiplication but they still use the cycle so it saves the energy it saves energy what doesn't say of the computation cycle but it still buys you the energy yes more questions just for the energy efficiency you know the percentage improvement what's the metric for efficiency live music so efficiency it means the speed and also the energy takes to run errands and when you say 16 eggs or hundred eggs what is it like what's the jewels or something [Music] there are so I'm gonna go back to life actually from the theatre divided lecture last week I prepare this slide from the most recent media's mix the precision training which can train deep neural nets using 614 point let me put it out very quickly it's a hundred slide so this is from a meal at UTC just one month ago it's possible to do twin you networks by feeding forward with floating point numbers 15 and P 16 and also that propagate with that PCC and also the brilliant the new thing is when we to do with out me with haptic using full precision 32-bit floating point to live outfit so this is using the mix mix precision training and then turn it off for inception model the accuracy convergence is pretty similar between that pc3 tool and this mix between the convergence is pretty much find the result on p32 once as mix of precision treating you've heard about a little bit different just by a little bit for the final converge to speed using 16-bit rather than 32 it saves about points like oh yeah another backups like yeah um must must potter and mentored and attend you up to try this idea of using ternary three weeks a positively a publicly wait you're on the negative wave so it's not we can have this is the meditation accuracy this specialized the baseline because even the theme of our brain compared with the full precision precision during the having only this three three weeks positive zero and negative okay so let's call it close to the open session thanks everybody for coming let's take a 5-minute break from community to epochal opposed
Info
Channel: Song Han
Views: 202,751
Rating: 4.8686051 out of 5
Keywords: PhD thesis, PhD defense, AI, deep learning
Id: EKZbdh6xia8
Channel Id: undefined
Length: 55min 45sec (3345 seconds)
Published: Fri Dec 01 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.