tinyML Talks - Sek Chai: Adaptive AI for a Smarter Edge

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
alright it's 8 o'clock on the dot let's get started good morning everybody and good afternoon good evening whichever part of the world you're calling in from we're happy to have you today we will have our second tiny ml talks hope all of you are staying safe and healthy during this lockdown and today we will have sex i from latent AI to talk about adaptive AI for a smaller age if any of you or your organization's are interested in sponsoring some of these tiny ml talks sponsorships are now available contact Barry at tiny ml Ark for more information and we can actually have different speak different sponsors mentioned here a major milestone that we achieved in the past week our Bay Area meetup has crossed 1,000 members and we'd like to congratulate razor cos Ravi co-founder of galleon who is our 1,000th member if you'd like to join our Bay Area meetup or any of our several meetups that have opened up across the world please go to meetup.com at the link mentioned below and you can join the one that is geographically closest to you be sure to tune in next week two weeks from now actually April 28th for we have two talks by professor song Hahn from MIT and Alexander aroma from octo neon it starts at 8 o'clock in each presentation is approximately 30 minutes and then please contact Talks a tiny ml dot org if you're interested in presenting at one of the upcoming becoming sessions as a reminder as I already mentioned in in the chat window please use the Q&A panel to ask your questions as I moderate I will be looking at the Q&A window for asking questions to say if you mention it in the chat window the question may get missed the slides and video recording will be available at tiny ml dot org slash forums tomorrow of introduction sex I is CTO and co-founder at Lady Di in previous roles SEC was a principal investigator for multiple DARPA and DOD projects and SRA international and also held senior technical versions at Motorola labs he received his PhD from Georgia Tech SEC has spent most of his career focusing on evangelizing efficient compared computing for embedded vision welcomed SEC and go ahead all right Thank You Ravi hello everybody thanks for joining in this talk is the the second talk in the Connie ml talks webcast so a very appreciative last week when we have Pete worden talked about how to get started in tiny ml this talk I'm going to talk a little bit and and cover some areas more on the software side to see how we can enable some of these tiny ml solutions so its title adaptive AI for smarter edge I will kind of cover different topics with respect to optimization quantization for specific and other things that leads to very important I think use cases towards the end ok first of all before we start big thanks to the tiny ml community you know Laney is a proud sponsor of tiny ml we've kind of seen it getting a grassroot support starting from scratch to where it is today and it's growing right so we're a proud sponsor of it hoping that it will grow even bigger right and special thanks to the tiny ml organizers I know IRA Olga you give me Ravi all those folks there's a lot of work behind the scenes to get this going and in this time where we all in lockdown kind of scenarios this is very important to keep everything going keep the move-in going so thanks for that and also thanks for all for attending right it's important to keep engage in this scenarios and you know thank you for all for attending and hopefully you learned a little bit from what we're presenting in these tiny ml talks and hopefully you'll get to and participate in volunteer to present as Revere mentioned so before we get started I like to cover a little bit about who we are Lanie I is a actually an early startup we found it in December 2018 so we are a little bit over a year old not too not in you but enough to get started and have things in in the works we are Asri start up a spin-off from s alright and we are backed by a lot of DARPA technologies that I've worked on while at s ri-right part of my work Wallet s RI what I've taken is there the technology built on it kind of moved it beyond the basic science with DARPA funding and into now industry hopefully to enable other folks to use some of these technologies Eleni is a VC funded our seed round was led by future ventures we are happy to have them support us and recognize that they they see some impact that we can provide with the technology the bottom line of it is ladyĆ­s develops core technology ml technologies right then enables efficient adaptive AI so we built tools a lot of people to build tiny analyst solutions right we want to enable other folks to be able to do what they need to do right enable the solutions that are important in the future this is the team I like to kind of give a shout out to my core team here everything that you see in the presenting slide so I've done by the team right I'm cheers as a person of all right presenting what they have achieved what they have accomplished very proud of my team here and the team is growing but these are the core sets of folks that have joined us and we taking a road trip with Lainey I in a way and just to get started right tiny ml at the end we all seen these kind of ideas or pictures about tiny ml its solutions where things are on the edge right they are very constrained there's uh you know reports Gartner reports and things like that that predicts you know by 2025 you have sending five billions of these devices it's everywhere right it's very pervasive it's almost hidden in a way because it's so small with these kind of tiny devices it's kind of connected to your sensors of choice it could be audio/video even vibration sensors right these are things that you know you put a smart processor next to your sensor and you can do a lot of things from detection recognition anomaly recognition all those kind of things right and there's gonna be trillions of these inferences annually things are just buzzing along working to help you know us live better live longer make things usable all those kind of things that aren't important the key things to this is that there's going to be a lot of these and all of it is working hard right trillions of inferences annually if not more right and what we want to get to then is say if you have a lot of these devices how do you make it efficient how do you make it really usable with the lowest lowest amount of memory a loss of our power and things like that right so efficiency at the edges is all we're about right we can certainly put you know big processor you know out there but that's not what tiny ml is about writes about saying there are ways in which we can make these efficient either algorithmically in software or in hardware what we want to talk about in this presentation is about quantization looking at one specific area which is memory memory for a lot of the these neural nets out there are known to be dominant there's a lot of parameters to configure a new work you machine learning do training on these things but there's a lot of parameters which means a lot of memory that needs to be consumed and the question is then how do you manage that how do you deploy something while maintaining the accuracy in this unique while reducing memory size so these are kind of like the agenda for this presentation we'll cover a little bit about the algorithmic development right to enable somebody's tiny amount ization specifically as an algorithm approach to to enable this efficiency at the edge right remember that there's other solutions out there including pruning that's network protected search will focus only on quantization this talk within the time that we have provided right and will show a little bit about the things that we can achieve with Leaney I tools but also and I kind of give a context of what our state of art kind of results that will show right and in the end I would like to highlight some important use cases can I give a call to action for some of our folks to see what we can do with tiny ml given that you can make things efficient given that you can deploy some of these newer networks with small memory footprint right what can he be enable so what this talk is about is giving you a highlight and give you a almost a quick preview of what is capable right we're not going to go too deep into the actual algorithmic kind of approach because it's really complex and mathematically inclined so for us here are just a quick preview and an understanding what can be achieved and given that what kind of things can be built what kind of solutions can be hi this is Ravi yes yeah I just wanted to remind the audience to ask any questions in the Q&A session and feel free to ask them during the middle of the table we want to keep this as interactive as possible so if there any questions about what check of it so far chiman yes yes certainly yeah thanks thanks for that yeah Ravi any questions certainly post them on the the chat Q&A forum you know folks will be pulling it Ravi will be kind of monitoring it and kind of ask me questions and Ravi would also ask questions to help clarify some of these things it's important to note that the quantization itself on the premise is easy but there's a lot of ways to get good quantization results right good ways to get the solutions that you want so it can be complicated downstream so I'm doing my best hopefully to anonymously gloss over the the topic but give you a highlight of what it means and what you can achieve so what we show here in terms of an introduction right for quantization is what what does the algorithm believe is about right and we talked in this slide here talk about post training and from post training perspective means you don't have to retrain your neural network if you are already a trained neural network what would you do in terms of reducing the memory footprint right so if you take a neural network and you look at any particular tensor any particular layer right and there's a tensor that describes let's say the parameters right these are values that you have trained when you do machine learning training on them and you look at the the histogram of the values in the tensors and you plot it on your on a plot right for full analysis what you get is that the top row right it's in the floating-point range there's some values perhaps centered around zero but the floating-point axis has a long kind of range right range from one one end to the other and it's quite large and the goal in terms of quantization is to map that set of values to something smaller maybe something that is representative with 8-bit integer so we call it intake and we want to find a transformation from the floating-point axis to the integer axis it's a mapping between one axis to do axis and the reason you do that is floating-point may have the wide range but a lot of your heart rate may be doing int age operations you're doing byte level kind of operations and you want to be able to squeeze all your workload and all your math functions really into the byte level computation so you do that set of mapping as you know because you're trying to in a way compressing from one domain to another domain you may lose some some accuracy there you might lose some precision in terms of describing what is the the shape in the the floating-point regime into the byte level regime so the transformation there might be some approximation in terms of your mapping but you want to do a good enough job such that it doesn't affect the overall on your network computation right such that you maintain your accuracy as you're doing inference at the end so in terms of a post-training approach for quantization you're looking at this kind of mapping in a subsequent slide I'll talk a little bit more about exactly an examples of what the algorithms might look like how are different ways in which you can map floating point integer but the general premise of quantization is about looking at values and making a best guess about where it should be mapped to on from one space to another okay say that came up in there were other ways to compress a model including pruning there's a question about whether we should do pruning first a quantization process is there any preferred order I think there are no kind of general best guideline kind of thing right I suspect that depending on the data set and depending on the mall or one can go either way a lot of folks do prune first before you quantize right because a lot of folks that design your networks right unless it's kind of really highly tuned a lot of folks or parameterize your neural network you just have a lot of nodes a lot of layers and all our connectivity right so as you as you do machine learning training depending on how you trained it you will find a lot of sparsity in your data set right you will find a lot of zeros in the weights that you are you can go ahead and quantize and you find that those are zeros anyway when you when you quantize and then you can prune the other way would be say I can prune it first and then quantize right because you know those are values that you're not going to be using any way and that will help in terms of your algorithm for quantization right in terms of hey if you take 1 million parameters and you by pruning you cut it by half the quantization algorithm would be greatly simplified respect to looking at a number of valleys that needs to you know quantize over thank you and then moving forward in terms of quantization let's look at training where right what I've shown previously was a post training quantization approach where you don't have to retrain the network in this slide here we talk about training where will you actually go in while you're doing machine learning training you have a algorithmic approach in there where you actually quantize as you go right you train in floating point and maybe a duration or epoch you say well maybe I need to quantize it to the desired range that I have and then you evaluate an annuity right this is the most basic algorithm other folks have including ourselves use other approaches in terms of training the neural network and quantize as you go there's approaches in knowledge distillation for example and many others that in the publication works and in with the tools that we have but the most basic one is you train you have to understand how is it quantized to evaluate your key I think the general idea about training aware quantization is that now that you are training the neural net where you have an additional dimension the dimension is bit precision meaning not only is your solution your solution space searchable and and you can do machine learning to find that that answer which gives you the best accuracy now you have an extra dimension with this precision where do you want to solution in terms of the accuracy that you want but also the bit precision right you could for example train in floating point 32-bit floating point 16 8-bit and below right some people have even trained at our self included training too you know binary kind of values right in this case you know the math and things like that will be much different when you go to binary but those are kind of approaches which we take right that the actual solution space is not only just the the right values for the the network that you're training upon but also the bit precision for each of those values right and as if an optimization for points named that you show what we show in the in the graph here is that yeah the solution space is hyperdimensional and you're finding a solution and accurate value but actually an extra dimension which is precision so your search space can can include kind of wider space because you have a bit precision in mind right in terms of finding that that actual solution that you want okay so the next slide here a quantization approach I want to kind of go through and give some quick overview of some approaches that we are doing right that people have looked into the mobile is basic one in terms of quantization approach so symmetric asymmetric and I'll talk about a few others and these are terms that people use in the quantization realm and these are algorithm in terms of mapping floating-point which is the top bar to the bottom bar so you might have some valleys for example in in the floating-point domain you have the maximum range some values in the middle and the objective for let us say a symmetric quantization is to map that similarly to an integer space right which is the bottom kind of graph right so you're going from one floating point and then to another in integer space in the symmetric kind of approach everything is centered around zero or you force it to be centered around zero the reason is a lot of computation in terms of the convolutions and things like that are centered around zero because the filter weights are you know in a sense bimodal and kind of map around zero so it's naturally fitting for that well some of hardware also have some special meaning for zero is actually more optimal to be centered around zero in terms of how they compute so Suns sometimes make sense in terms of using symmetric quantization algorithm for for the quantization approach in the symmetric asymmetric can approach similar kind of idea you have some values in the floating-point domain and you want to map it to the interests integers space right in this case you're no longer center around zero but you can actually be any offset from from zero and this might make sense for other types of Blair right and within your network that that allows you to kind of move between different kind of ranges right now necessary center around zero and then there's other approaches like lock dreaming and this is an interesting one where if you look at it you may have some values in the floating-point domain again spread over a maxim in kind of range but in terms of mapping here you go back into the integer space but you go a locker and it can approach any think of it as you know things that are centered around zero are probably more directly mapped but things that on the fringe I actually mapped differently think of it almost like a like a fisheye lens right things in the center or less distorted but it's kind of naturally Mac but anything on the edge you kind of have a curvature to it right you're trying to map more range that you can but down to a powers of two and bins right so this allows you to have good fidelity around zero because it's centered around zero and then anything that is outside you can able to kind of reach out for alpha to get the range that you want right so again another kind of mapping approach I'm using that as a an offset to the symmetric and asymmetric to show you that day there's different ways in terms of mapping from one space to another space right and even given this approach there a lot of things in the in the academic domain some in production as well including channel based quantization mixed precision I shouldn't wear and many more right many more that are available right and you know the trick about doing a lot of these work is to realize what can be used when to use some of these approaches right and which one makes sense right and you know within the context of this presentation is to kind of just make you aware that yeah when we say post-training quantization for example you could take one of these algorithm don't be satisfied because there may be other approaches that allows you to get a better solution the most important thing I want to convey to the audience is that you know what is the best approach right a lot of it depends on your network of choice right and the data that you have right if the data set is the things that you want to trade upon is very simple like meaning I can easily partition your data set then you know any quantization algorithm would do well but if if you have a data set that is very hard to partition right there's a lot of mixed signals and things like that in there that are hard then you might need some kind of heavy-duty algorithm in terms of quantization so there's there's no answer because it depends really on your data set one quick example would be if you look at M this data set which is handwritten digits or C for kind of data very kind of small simple kind of data set very simple asymmetric algorithms horrible well when you get to image net 1,000 classes you know lots of data and even more complex data set that have visually similar kind of feature sets right a cat and a dog look very similar for example feature wise those are harder and you need more powerful algorithms if you're doing post-training quantization you need more complex algorithms and more capable ones to allow you to get to a good solution so with respect to post training results and you want to transition now from a mathematical and you know description of what the algorithm does let's look at some results right so post training results and let's just pick a symmetric for now in terms of analysis right what you see here is we simple set of models in section V G G ResNet you we give a baseline floating point and then compare it with let's say integer eight quantization so if you look at file size alone you're seeing a 4x compression right things are represented in floating point 32 you need 32 bits to represent the value you have a lot of range right what's some of the range you don't need right and you can actually take that convert it map it using the quantization algorithms that I talked about get to intake space and looking at the file size alone you get a 4x reduction because int 8 will be byte level compared to a floating-point which is 4 bytes so you get your automatic for 4x compression what's important then you look at the accuracy right you do the mapping and I talked about the the mapping is really an approximation your best approximation the results are pretty good in terms of the the network capability and also the quantization right you lose maybe to the tree percent in terms of accuracy in some cases you may not write me GG 16 we may not push it hard enough in terms of the quantization and for this particular example not so good right so the point here is that yes you know this thing works this thing in terms of the quantization post-training can get you to the compression that you want I can also get you the accuracy you got to be smart about picking the right algorithm and kind of tuning from there right and these are results that we use with Lanie itools with respect to quantization and providing the results that you have and running some of these networks both in in the machine learning space but also natively on hardware alright but there's also other tools on Google by torch all those hooks have you know approaches that I provided in terms of quantization right so then let's look more in terms of the post training quantization results right so in this case here let's look at mobile net right let's take one particular class of models out there and say ok how do we do in terms of that particular network rather than going across different networks let's look at one particular network and I pick mobile net because mobile ed is notoriously difficult to quantize using asymmetric a lot of publish annotations about now asymmetric doesn't really work well for mobile man and we agree right this answer here in terms of the table if you look at the left table mobile that'd be one if you used in a four roll down asymmetric into eight right when you quantize it with that algorithm you get very poor top one top five result it's just it's just poor not because the algorithm is Antonio the network it does work but well this particular network there's different ways in which he uses the values in terms of how its hand tuned and and create it right makes it hard for this algorithm you can use other you know training post training quantization algorithms like the per channel kind of approach and you get the the accuracy back all right you get the accuracy back because you have a different way of mapping some of the the floating point into the introduced space and you get accuracies if you look at the third world out per channel intake you know very close to where the baseline floating point is like negligible one one to two percent kind of difference in terms of the accuracy so pretty good result if you look at inference per second speed in terms of inference right compared to the accuracy per channel might be a little bit not as good i say compared to an asymmetric kind of result and and also you know there's a trade-off there in terms of what you get in terms of accuracy versus inference now that that being said right selecting the right algorithms is going to be important for the the model and the data set we have actually have other algorithms in which have use and you can get the same accuracy as the baseline and also get the inference speed that you want right if you look even on the mobile that'd be too similar kind of performance in terms of what it can do asymmetric doesn't work well as well per channel works well but you you suffer in terms of the inference be and our approach we can actually get the both the accuracy and the inference back up to kind of top comparison so if you look at just the mobile that v2 loan you know we can achieve the the speed performance compared to any symmetric approach so if you just compare our approach versus the Pacino about a tax if you just compare the top five result of a hundred twenty-five X in accuracy right the bottom line is for take away from this slide right is to kind of convey to you that yeah you got to really select the right algorithm right don't be satisfied with what is given and you can actually do much with it in terms of getting the performance and also the the inference speed that you want right so a quick question so when you move from FP thirty-two to int age generally we expect to see that the inference per second goes higher right in in these tables so you show you actually it shows that it's slower in in date right right right so that's actually a very good question and a very good point that I wanted to make if you look at the inference speed as Robbie was saying the baseline inference per second is quite high and if you could look at the intake result not as high as the floating-point right one to one kind of mapping is it's not as fast and you expect that hey if I'm going to byte level computation you get harder well a lot of it depends on the hardware when you do comparisons let's say for inference P latency of the inference and power for example is very hardware dependent right in this case here the the format that we use in hwc number of which really a number of batches height wave channel that format in terms of how we store it doesn't work well for this particular hardware right which is we're running on an x86 CPU a different arrangement of that well that data format will give you better or the way of saying if you have a GPU or different type of hardware that's kind of targeting nhw see data format you get the high speed map so it's very hardware dependent what we want to show here is that relative to each other let's say relative to the bike level computation so you can look at per channel asymmetric and our approach is it's at that range it's comparable but going from a baseline floating point to a byte level then there's other effects in in place which includes memory includes the hardware and things like that I will give you another example where if you're running on let's say an x86 compared to Raspberry Pi and you're looking at comparisons between floating points and integer well bear in mind that the floating point unit in an x86 for example is much different from a Raspberry Pi floating point right so the capability and the memory infrastructure respect to the hardware is going to be important in terms of comparing you know floating point versus integer right because it's really at that point dealing with how did you map it to hardware and what are the formats and use and what kind of celery just you have right and the talk here is really saying hey at least from the quantization perspective you got to be smart in terms of a how do you choose the right algorithm and then the second after effects which is you know let's pick the right hardware that matches how you want to store the formats okay so we shift gear now to this slide which talks about training we're right in this case here we are actually going through the process of training the whole logger and myself right and trained as you can quantize along the way we use an approach with using knowledge distillation and we can actually take for example a knee or network train it get a baseline result and then we can look at whether we want to now quantize two intake maybe using asymmetric an approach or even a powers of two can approach now bear in mind that the results shown here are map scores so it's not accuracy but map score respect to the bounding box right definition around the the object of interest so bicycle you want to actually locate where is it in image so that scoring is different right in terms of the metric you use it's a mean average precision in this example here Pascal vo C and the resolution we uses 224 by 224 right best in class that I've seen is about 0.7 for you know mobile that'd be to SSD and you know folks might correct me in terms of what the result might be and I would also qualify there's different published results in there with different resolution so some have gone 512 by 512 300 by 300 this is 224 back to 24 usually the higher resolution images tend to have a little bit higher accuracy just because you have a little bit more context and a little bit more you know features that you can pull out right in terms of analysis but within the context I think these are state of art and what you see here in terms of the result baseline to a symmetric Caparo to you so you can really maintain the the accuracy right you look at the result from floating point 32 point six three four to a symmetric point six two two four and even powers of two with four bits only encoded kind of result point 6:01 right it's a really good result down to the 4-bit level using power to an approach right and this is afforded with training aware so training aware the way we're doing it is finding the best optimal solution for the bit size that you want so if you're trying to hit four bits and you might try post-training quantization if you'd not satisfy the result you can go to training aware where you actually go through the machine learning process to find a solution and the solution includes looking at the bit precision dimension as well finding where are the right points right and these are all tools generated with the Leaney i tool again other tool and frameworks also provide that but some of these capabilities in twin powers of two and things are not necessarily readily available another tools okay and you know good to actually spend a little bit more time on on this if you look at the result very interesting the ground true from the bicycle floating point detection right 97 percent in date was a little higher 98 right well how do you get more by quantizing and in fact you can get a little bit higher accuracy yuri distributing some of these values around some classes might be you know naturally better but you might trade off other places as well right similarly here if you look at the bottom one ground true and then FP 32 and then to intake the floating point detection actually picked up the potted plant around 74 percent accurate the intake one does not have that detection so again you're shifting the the weights and distribution around to give you the optimal but it's the best approximation to what a floating-point might do and as closely canted to the ground truth now if what we've shown so far is quantization both post training and training where and what what you see there is results that up you know pretty good and you know these are things that are very complicated right you're talking about imagenet 1,000 classes and to be able to to train and compress even down to four bits to get the results that you want but is there more right and what we want to convey with this slide is that yes there's more things that you could do in terms of getting the efficiency right in this case if you want to kind of showcase a little bit about a new technology that we're building call leave the dock and the way it works is that you can actually get the neural network to automatically adjust what it needs what performance you need what you show here in the in the video is that the network itself can travel up and down so you value here represents utilization we can actually have the newer Network Prato its utilization from you know anywhere from 30% to 100% right it determines way it needs so in this video example where you know you're doing it let's say a left to right so I go up and down cannot gesture the neural network and the onset of it would run naturally in in low utilization nothing is happening so you don't need to start off very heavy but as he starts to detect an onset of a gesture onset of something it says well I need to have more confidence in terms of the results right because something's happening but I'm not doing very well in terms of recognizing what that might be and then it would itself throttle up to the level that it needs right so in this case here as he's doing let's say a gesture a left to right or up and down the throttle meter will go up maybe up to 80% or 90% and say do I have enough confidence at that utilization and it naturally goes up to a point where it says okay I'm highly confident now that this gesture is actually a you know let's say a left or right or thumbs-up kind of gesture then it will actually Tonto back down because now I've completed the gesture and then you can peak right this case here what we showcase respect to the tiny ml audience is that you can actually let the newer Network decide where it needs to compute at in terms of its performance threshold right if the applications require you to run at a certain time at high utilization you should but when he doesn't need to you don't need to run it that heavy right and these are another ways of saying you can actually have the neural network decide at runtime way it needs to run at I as opposed to saying if I train a new network today and I have to run and compute all of the nodes that would be a couple more of a static approach and you're kind of bounded by the amount of computation that you need in this case here you actually training the network in a way that at runtime you have the decision that you can make in a lot of cases this example here where we show a gesture recognition algorithm you can't travel up and down because you leverage the time aspect in terms of a gesture in other cases for example surveillance video where you analyzing you know and detecting human presence and things a lot of surveillance kind of videos are in a sense there's nothing happening machine you don't need to run it happy kind of workload you can actually travel down and actually ramp up when there's something happening in the scene right that in a sense for tiny ml solutions allow you to extend your battery life much more than just say running it statically at a hundred percent all the time right so this gives you another dimension in terms of what a actual runtime would be right where you can actually have the new network your system solution throttle up the in devices as needed right with minimal loss in their accuracy and we think that something like this would be a game-changer in a way right for a lot of tiny ml solution that you're no longer bounded by just what the new network tells you you need to compute right if you're given 1 billion parameters for example and you say well at every instance when I need to compute it I need to compute that 1 billion parameters for you in terms of the network layers and you have to go through the approach that we're thinking is we're actually deciding at runtime maybe I run 50% of that utilization you know of the notes that needs to be computed and be smart about how you want to compute it right and given all that in context now that you can quantize a new network you can do it post-training or you can do any training where even have new kind of technology coming up in which you can dynamic control all right let's talk about an example use case right yeah and that is important what can tiny ml allow you to do right what are some solutions out there and I just picked this example as a starting point for a lot of folks right in terms of a discussion point so for example let's consider voice as a means for contactless UI and today's scenario we've coded 19 kind of scenario things right we having locked down things nobody wants to touch anything because of a contamination and what-have-you voice becomes a very useful means in terms of engaging systems right so voice has the context as UI you have today you know pervasive views for example Siri Alexa those kind of things out there so he's already coming but let's make it smarter let's make it always-on you know there's privacy concerns about sending all your voice and sound data to to the cloud so tiny ml solutions because it's done at the edge preserves all that privacy because nothing leaves your your device and you can doing it on premise there's no internet connectivity needed when you're processing is on the device so you make sense an email solution as a contactless UI in the picture we showed maybe a kind of smart home box here like an Alexa or a you know Google home device but it can be anywhere it can be you know think about an elevator shaft where you want to press the button but you can actually voice it out it can be a door lock and you know there's a lot of possibilities here suffice it to say we looked into these kind of examples we've done let's say 30 wake-up words very small networks compressed it we've gone a five you've seen 5x kind of question with the kind of audio recognition example open source example there we have done networks that are much smaller and shown 10x compression to bits right for somebody's networks right so very capable and things like that what the tables show is a training where quantization approach right because we're really trying to optimize it we're getting about baseline 70% for 30 wake up words and we can and I've shown you to two level of results at using powers of two six bits we get seventy seven point one so negligible kind of quantization effects when you get down to five bits you start seeing and dip in terms of the the accuracy and what we suspect as you go lower into four bits and three bits and things like that it would start to tail off in terms of the accuracy you start to see the quantization effects there but it's pretty low in terms of the bits that we can get to and this is only the kind of start in terms of what we can explore right there's a there's more capability in terms of what data sets that you want to look at and also the type of models that you want right but the software solutions the tools that are now available respect to post training a quantization respect to training aware and even kind of some of the new things coming on is starting to become available and the use cases that we see here will be very capable right in terms of enabling some of these solutions so what I'd like to do is end with a call to action right tiny ml solutions can be very purposely well-situated purpose-driven aligned to very wordy costs you know think about the pandemic situation right we can employ a lot of these things even for recovery to recovery from from lockdown recovery from you know handling you know saving more lives out there right and it's really a driver to make the world a better place so I'm calling out to the audience calling out to folks looking into tiny ml to kind of do good things respect to the technology that's coming available we're opening it up with a tiny ml survey just to understand where folks are in terms of developing some of these solutions but I really started to engage the the audience in terms of you know what might be good tiny email solutions start the thought process right respect to engaging among the community but also you know deploying solutions that are out there right the link here is available as a QR or the bitly come link but is also available when when the slides are posted up you can look it up again okay so with that in mind and I'll move quickly to the next I'm sure there's a lot of questions coming up you know thank you for the time I'm hoping that you've got a little bit out of the talk here and you get the kind of the gist of tools and capabilities coming up very capable to make solutions that are meaningful for everybody to use right any questions contact me my email generating email here but you know how to reach me right thank you very much sick there are a lot of questions I try to answer a few but in the interest of time let's just go to the wrap-up slides and then we can come back to the questions shortly so thank you audience you will now see a five question poll pop up in front of you please take the poll as a help us get you know keep improving our tiny ml talks and feel free to continue the conversation a tiny well.org / forums you can ask new questions we didn't have a chance to ask it here the slides and video will also be posted there in the forum's tomorrow just a reminder for the next talk on April 28th the names the speakers and the topics are here we're trying to speaker setting next time so let me jump back into the the questions and SEC feel free to move back to the slides that are relevant to the question you're answering one of the most common questions that came up you know multiple people is about the throttling and utilization one question is how do you define utilization I'm just going to moving back to that slide that's okay how do you define utilization and the other question is how do you do the throttling like I said with another network or is it by changing quantization or we get to the ranch okay go back to that slide yeah so there's a lot of things that one would say respect to you know how this can be measured right metric wise I seen a lot of uh you know I'll start generically and say a lot of folks have looked into performance-based metrics right for you know qualifying how well a tiny ml solution might be how small is it how much memory you're using how much power using those are still very static kind of thing the question is then when the system becomes dynamic right how do you really measure how well it does when it's the the way it runs is context dependent right if you have a left to right gesture you might have a different utilization compared to up-and-down kind of gesture so it becomes a kind of new set of metrics right how well do you detect it so I will generally say that the metrics itself in terms of saying how well does the dynamic system work I'm not kind of well set for near an hour the question about utilization and the way we measure is how many how many nodes right within a new Network how much of the neurons let's say it needs to be computed so if it's really when we say 50% utilization it really means that out of the you know let's say 1 million parameters we are actually computing half of it right half of the networks are actually active the rest we don't worry about right we actually have a you know a pseudo term you say well maybe this is like dynamic pruning when you when you prune when you statically prune it in terms of a training phase right you actually cut off the network layers to say well because I don't need this anymore you actually chop it down so it becomes small you can quantize you can get to that in this case here you're doing dynamic pruning in the sense that you're not actually chopping it up but at runtime you decide what parts of their networks you don't want to run are you getting those things off then the remaining is the one that you compute and as long as you covered the the right activation paths you get a good enough answer and this is why we're getting at that a lot of the use cases here we can actually eke out more performance in terms of the battery life and Lane C and things like that by being smart about how we compute the necessary chopping it off at training time exactly yep that's great thank you somebody else is about there are any publications or references for this runtime scaling if you have a chance if you could post it on the forum response later that would be really great yeah definitely what there are a number of different publications including ourselves and we are putting more out there a lot of them also a different terminology right that that's becoming hard because this is so new people have used throttling people use gating dynamic inference a lot of different terminology I would even argue papers are also getting hard because a lot of people are so set in how its naturally computed today which is running inference on everything that once you start with something really new it takes a while before it gets picked up so there are pockets of different papers out there and we assume put something out as well but definitely you would like to share with other folks to kind of open up this space thank you another question about and there's a few different questions about this related to late in the eye and whether you could do this adaptive AI with are there other open source tools then ladled AI or is it something custom like at least what are you showing is something custom that need are they are tools I'm not aware of any other tools that are publicly available or from an industry based perspective we are making this available soon the the thing about this is how do you make it more generic that it becomes a commonplace right if you look at where things are today machine learning training you know let's say called nets and things like that standard fee for networks pretty standard the folks in the AI community you know knows how the user is in deployment very standard today quantization training aware are starting to be that case right still a lot of kind of basic knowledge and science really get it out but tools are starting to become available these things for for this dynamic thing are in the onset of of being interesting and and saying how does one a deploy and be making tools that are available that's that's the harder part the key thing here is not just that it can be dynamic right is that for you as application developer to start to think about how else you would change your your a solution space right because at this point you can I think about not only running one but you may run multiple just because now you have the ability to kind of switch between one modality and another to switch between utilization levels of one network you can start to game up another network it might make sense even thinking about when you have multi modal sensors you know if you have I'll just pick one you know radar versus lidar versus visible right cameras right let's say autonomous canal systems right you have three different sensors but you actually know in context what what sensors make should be more dominant in different scenarios right if it's rain maybe you know some of these larger radar makes more sense or if it's you know different context you might want to wait things differently you can actually utilize your network differently but what this dynamic system allows you to do then is make that decision at runtime right that it should be context dependent so not only is it saying hey you know we want to develop these things and making it available you're asking about tools but also really push it from the application perspective right what what should change in terms of how you deploy your systems another related question about later today is like business model or particularly what what kind of a hardware platforms do you develop these kind of adaptive AI throttling right well we you're trying to be model and agnostic so we support a number of different models out there be shown the results there we also trying to be Hardware agnostic right we'll support you know x86 arm of different platforms oh well for business kind of discussions I'll defer to no one on one basis we can actually you know contact us we'll we'll talk and go for it but from a tool perspective we're trying to be agnostic and work with any any hardware out there and we appreciate that within the tiny ml community there's a lot of different you know projects and systems being deployed right all of them different all of them have particularly use cases that are important and we want to support all of them okay thank you another question about the audio way quad results whether you have inference needs for these results a teacher here I have industry but not not shown here for very tiny consistent for wake up things it becomes very hardware dependent right how it depending from from really you're saying what does it make sense that's why I didn't put the the results here we're working with a number of you know customers and folks that have solutions in this space so I'm not readily able to kind of show some of these results but you're happy to talk to individually and see how we can help enable some of these solutions out there okay in the interest of time we have a few other questions about like 15 other questions on the QA but what I can do is people might have time constraints I'm going to just jump to the last part but oh so of other folks please know I'll wrap up your questions into the forum and second I will go through and try to answer these questions the next couple days but thank you SEC very much for a great talk it's very informated we really like the technical depth we should here thank you I'll stay around for for Q&A all right a few other questions just a reminder to everyone because people are asking that in the chat the slides and video will be available tomorrow please look at the forum and forums a tiny ml tiny imelda darkside forums and we any further questions we will take over there so thank you you mentioned that you are available for a little bit longer for questions sure I will actually spend a few more minutes with questions one question is that it's quite popular is can you please comment on quantization using low precision format bol FB format for quantization such as unum or pause it instead of intake I'm not I'm not fully familiar with these terminologies if there's something that you follow yeah there's this different folks looking at different formats as well you know you know intake will be one so that will be a sign kind of representation so you have positive negative when you use asymmetric for example we go to you in Tate so we unsign so that you know you don't have to carry the sign bit so you get a little bit more some folks are looking at even more flexible floating-point formats right to enable new hardware that can go between one and another I still see this as a very kind of Wild West right now respect to this space right in terms of storage format in terms of you know the data precision things like that still a little bit ad-hoc in a way you know compared to many different things that a little bit more mature for example still a little bit out there we have a lot of experience and things looking at different model formatting so that for example TF lighten things right so those are crudo buff kind of bass even those are I think not necessarily standardized per se right so I'll just leave it as yeah there I think standards being formed and people looking into it but I don't sing anything right now that says yeah this is the one that rules everything another common question about this is you know bite at rest you know CPU or microcontroller is it any beneficial to go less than eight bits if you go extreme binary I know that there are specialized hardware to do like the pop count and things like that but if you go five or six bits is that beneficial to do in a general microprocessor or do you need custom you know that's a very good question right you know you have byte level compute a lot of folks are at boundary and that's a very natural thing from the types of code that we have respect to you know our natural programming kind of constructs right let me program people understand even from the multimedia data set perspective right byte level seems to work and work well I'm starting to see a lot of activity going to four bits right because you can train some of these newer network for very interesting applications at four bits right so you could do it at nibble level right four bits and still get very tiny kind of solution there's a lot of discussion about well why for why not five right where do you draw the line right for a lot of micro processors in there you know you got to choose your powers of two cannot level right eight bits four bits and I make sense you get the binary of course it's different but if for example some of these solutions might have this newer celery's might have an FPGA or some sort of reconfigure fabric you know every single bit that you kind of carve out becomes resources for other things useful so we don't necessary from from our perspective say we have to stop it at you know eight bits we can actually go below and it kind of open it up right I also convey that the idea for a lot of these quantization also is a design exploration can approach right so if you have a data set you can certainly train all on it and get to eight bits right well what the the engineer and us wants to know is what what are some limits right you can actually push the boundary and say so how far before it actually drops off go below so you know in terms of the capacity of the network in terms of training upon that data set and you go as far as you can so if you go down to for example let's say six bits a lot of these networks can sustain the accuracy at six bits you know your computer's at eight but you go down to six bits that tells you the the the level of confidence in terms of meaning you know and adding more classes right respect to that that gap between six and eight bits so it's it's actually a tool that allows from you know the data science perspective to understand what some limits that one could use right and Explorer absolutely make sense so in the slides that he showed I think he talked about the weights being quantized the activations are still full precision or are they quantized as well in the position right so in the examples everything can be quantized so the first slide where I show results on the post training inception region everything is quantized all activations everything and we're running at intake and the other side where we had the mobile net one the inference speed everything is running at intake that's why we're able to show the the difference between performance right in terms of inference speed and I mentioned that the hardware is dependent at that point in terms of the data format so indeed we're able to quantize everything excellent we are five minutes over the hour I think a lot of questions will take up any remaining questions on the four let's thank our speaker say again thank you very much and I think you do all the audience all right thank you
Info
Channel: tinyML
Views: 1,856
Rating: undefined out of 5
Keywords:
Id: RWG0Pga0xbo
Channel Id: undefined
Length: 65min 38sec (3938 seconds)
Published: Tue Apr 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.