Small Brain, Big Think: AI on the Edge

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Great work.

👍︎︎ 1 👤︎︎ u/salien031995 📅︎︎ Dec 19 2020 🗫︎ replies

Captions

So I came up with the idea for this wearable keyboard. I'll put on a data glove, draw some letters in midair, and it would type them out over Bluetooth into my wearable computer. I built the hardware, I wrote a training app, collected a whole bunch of training data, and I TensorFlowed the hell out of it. 54,000 weights, 600 neurons, 15,000 samples, and 500 epochs later, and it worked pretty well! ...on my 60-pound liquid-cooled gaming beast. Here's the problem: the model is 75 megabytes of data, but the glove only has one megabyte of RAM. The TensorFlow library is 400 megabytes, but the glove program memory is only 2 megabytes. I had to cram this big neural network into this little wearable device, and I am going to show you how. Sometimes the cloud is too far away. The coder is willing, but the processor is weak. It is times like this when you must do big brain think on little brain. [Bonk!] We gotta bring our machine learning... to the edge. [Ominous rumbling] I can feel the web developers' veins bulge as they hammer away on their 60% ortholinear mechanical keyboards, keycaps clacking against the sustainable bamboo support plate, as they froth at the mouth, spitting flecks of cold brew against the glossy screens of their 2019 MacBook Pros, screaming, "Use the cloud! USE THE CLOUD!" [Rumbling intensifies] But that zealotry conceals fear, a primal fear that pricks away at your sanity in the dead of night... that the philosophy you built your whole life around may not be as absolute as you think... that sometimes you can solve a problem... without... using... the Internet. [KABLOOIE!] Californians may not know this, but there are times when an Internet connection is unreliable or unavailable. The device might not have the processing power or battery to handle wi-fi. A cell phone to connect with through Bluetooth might not be available. I mean, even if a solid Internet connection is there, that 50 milliseconds of round-trip latency could be a game breaker for some user experiences. The fact of the matter is that some user experiences are best done completely offline. [SLUUUUUURP!] 'Machine learning at the edge' is pretentious startup-ese for doing AI inference on an embedded device. An embedded device in this context is something that isn't a cell phone, a computer, or a server. These devices are often considered to be too weak to run sophisticated machine-learning algorithms, but bridging that gap is a critical part of advancing technology. Relying on managed cloud services for your toaster's business logic is like, it's like carpet bombing a mosquito. It makes a lot more sense to process the model so you can run it locally on the device using the hardware available. By taking that sophisticated model and running it locally on the device, you get better scalability, increased efficiency, and just a superior, more responsive product. Machine learning at the edge works because training a neural network is a pain in the ass, but running a neural network is not. In order to train this neural network I had to collect 15,000 pieces of data and then run them all through the network 500 ee-pocks... eh-picks... 500 times in order to calculate 54,000 weights. But in order to RUN the network, all I need to do is collect one sample, run it through one time, and then scoop up the output. In a way, machine learning at the edge is actually a bit of a misnomer, because the learning takes place ahead of time and all that's being done at the edge is running the model. Once my glorious handmade 60-pound pixel cruncher is finished chooching, what remains is the machine learning model - the thing that actually makes the decisions. The model trained with 75 megabytes of data, but the model itself is only like 600 kilobytes. So here's the plan we're gonna start with a neural network that we create or, y'know, rip off someone's Jupyter notebook, as well as a big ol' pile of training data. We then run that through TensorFlow Regular, right on a biggest, most majestic computer we can find. We take that finished model and then we run it through the TensorFlow Lite processor. This, like, sort of minifies and optimizes the model, which reduces its size and also lets it run more efficiently, but here's the cool part: we're not going to run it with TensorFlow Lite. Instead, we're going to take that model and integrate it into an Arduino sketch and run it using TensorFlow Lite for Microcontrollers. This is the smallest and most efficient TensorFlow available - it's only like like 16 kilobytes and it lets us run the neural network on the very limited processor and fit it into very limited memory. It introduces a lot of gotchas and restrictions, but if we play by the rules, we'll be able to jam the network and its full functionality into the glove and we'll have a neural network that we can wave around. As of August 2020, this is bleeding-edge [REDACTED] and it's got serious limitations. Many activation functions are broken or just straight missing; dense layers and convolutional layers work great, but recurrent networks don't. Finally, in order to fit the model on the device and to process it in a reasonable amount of time, it has to be quantized, which means converting those 32-bit floats that you usually use as weights down to 8-bit bytes. Dropping everything by 24 orders of magnitude seems like it'll take your precisely-trained neural network and kick it in the crotch, but in reality, it's not that big of a deal. Your network should already ignore small variance in input values, and even if it doesn't, you're not really getting 32 noise-free bits of every sample. If you are, then you're probably working at a research lab or something, in which case stop watching YouTube and get back to work, you slacker! You have a virus to sequence! NOPE! Otherwise, the sky's the limit. As long as your microcontroller has enough memory to store the model and enough oomph to do a few hundred thousand multiplications in a reasonable amount of time, the model will run as well on small brain as it does on big brain. [Marching band music] Let's choose our weapon. Any microcontroller with, like, a megabyte each of flash memory and RAM should be enough to make this work. NRF52 boards, ESP8266 boards, Cortex processors, they all work great. Your regular-ass Arduinos and PIC's are probably not powerful enough. [Smashing sounds] Modern, faster boards like the Arduino Nano 33 are A-okay. Single-board computers like the Raspberry Pi are powerful enough to use the full-strength TensorFlow Lite, and honestly, with something like this, you could even just run full-on TensorFlow and make your life way easier. You will have the overhead of an operating system, so your project might actually be better if you [shattering sounds] switch to a faster and more real-time microcontroller. This Teensy 4.0 is perfect, and even this Teensy 3.6 would do great. I used the Teensy 4.0 in the glove project. Now that you've selected your microcontroller, it's time to stuff a neural network in it! First, we load our Keras model, then we instance a TFLiteConverter, which will perform optimization such as making it run faster, making the code smaller, etc. Or we can do neither, because frankly [laughing] this doesn't really work very well! This is a neat part of the process that's pretty easy to [REDACTED] up. We need to provide a representative data sheet that contains at least one instance of every label... [British robot voice] You mean a representative data SET, not a data SHEET, you dumb bastard. Disliked, deleted my comment, and unsubscribed. It also needs a distribution of input values that's similar to real life data. It's really easy to do using this library SKLearn - we feed our model into it and we have it pull out a stratified sample, which is a representative sample. The optimizer takes this and generates lookup tables and scales input and output to make the most of those eight bits of resolution and to make the most of our available memory. Finally, we run the optimizer, which does the thing and gives us a TensorFlow Lite-ready model. WOO-HOO! Time to get it into the microcontroller. Setting up a makefile to take this model and link it into our firmware is just a hell beyond hell, so we're gonna hack it. We're gonna use the hexdump library, that dumps a binary to hex, and then I'm just gonna format it into a C++ object declaration. We're gonna save this as a header file and now we can just drag it and drop it into our firmware sketch. What? It works! It's time to leave this big-boy IDE for grown-ups and go into the IDE of chaos and deviltry... let's switch to Arduino. In Arduino, we want to install the TensorFlow Lite for Microcontrollers for Arduino. We don't want the pre-compiled one; if you're using anything other than an Arduino-brand Arduino, you're gonna have a bad time. Let's check out the code. My code is just copied-and-pasted from Google's examples, and those examples are written in ABSOLUTELY RIGOROUS compliance with Google's C++ style guide. It's pretty neat! Anyways, let's dive in. First step is to import the model from that header file we just generated, then instancing an OpcodeResolver. We can kinda optimize RAM usage by only including functions we actually use, but I'm really lazy, so we're doing all of 'em. Then, we need an interpreter, and we need some memory for it to work in. We need to set up an arena, which is a pretty death-metal term for what's basically a pre-allocated scratch file. There's no, like, hard-and-fast rule to how much memory to allocate, so I'm gonna start with two kilobytes. We'll dial it up if we're getting buffer overflows, we'll dial it down if we need more memory. We get a pointer to our input tensor and a pointer to our output tensors, and we're all revved up and ready to go. This is where the rubber meets the road. Every loop, the code checks if I've finished doing a gesture, and if I have, it processes it into a standard format. Pre-processing is really important in TensorFlow Lite. Anything you can do to reduce the necessary complexity of the model will let you make the most of the limited resources; we are still constrained on how much model we can actually fit in this thing. All input to TensorFlow Lite models is flattened; in other words, instead of putting in 50 XY coordinates, we run them in X, Y, X, Y, X, Y. All that's left is to perform our inference and let the eldritch gods of cyberspace figure out what I just wrote in midair. And that's it! We just performed sophisticated gesture recognition on a device that can't even buffer a five-megapixel image. Compiling this monstrosity takes a while, it takes up, like, 20% of program memory and, like, 80% of the RAM, and it takes, like, a good five minutes to crunch the first time, but it's worth it. We can run an inference on this 600-megahertz processor in like 10 milliseconds, which is AWESOME. I mean, that gives us another 10 milliseconds to faff about and still keep the thing crisp and responsive. This handwriting recognizing glove totally works, and if you're interested in the hardware, or you just want to see it in action, I did a whole video on it and you can check it out right here. Call to action. But what else can you do with this? Try using some convolutional neural networks to analyze video in real time! Try some real-time image recognition or audio recognition right there on the device itself! Capture false outcomes and add them to your training set later! You can even add more memory, like physically add in more flash chips, to store and work on bigger models. What's cool about TFLite for Microcontrollers is that it runs the model from program memory... I think? It does. Don't quote me on this. I will. Be sure to check back on TensorFlow Lite and TensorFlow Lite for Microcontrollers often, because they are under very frothy active development, and new features could be added and new regressions could be introduced at any time. So the next time you want to learn you some machines, do the big-brain play and put it in a small package. Just don't cut yourself on that edge. Thanks so much for watching, and double thanks for watching the whole thing! If you want to look through my terrible source code, it's all on GitHub. Ravioli ravioli, links in the descriptioli. I make videos about electronics and the crazy stuff you can do with them, and if that revs your engine, feel free to give me a subscription and get notified when the next video is up. Or you could roast my programmer's tan. I've been underground in New York City lockdown for, like, five months. I know I'm translucent. Anyways thanks a lot for watching, and I'll see you in the future. [Narrator] it wasn't so long ago that communication was a simple act but the range of the human voice is limited. So, man's ingenuity found ways to bridge distance. He invented writing... ...and typographical errors. a representative data sheet that representative data sheet representative datasheet You mother [REDACTED]er

Info

Channel: Zack Freedman

Views: 402,650

Rating: undefined out of 5

Keywords: ai, machine learning, neural networks, deep learning, teensy, arduino

Id: iTj0lcVSIVU

Channel Id: undefined

Length: 12min 40sec (760 seconds)

Published: Mon Sep 07 2020