Using GPT-4o to train a 2,000,000x smaller model (that runs directly on device)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Yan and this video I want to talk about llms or large language models on the edge so the latest generation of llms like Chad GPT 40 which was released a couple of days ago are absolutely truly astonishing because they understand the world sort of in the same way as we humans do so these models are are multimodal so they have multiple senses um and you can do really cool stuff so they can understand text but also uh images and audio and it means that you can ask it questions about the real world um sort of like you would do to another human for example um let's say I have a camera and I point that camera on my factory floor and I can ask the llm hey is there a person not wearing a hard hat standing really close to the machine and and I will say yes or no and based on that I can have actions this is my my safety protocol in the factory for example this is incredibly powerful because you know it understands what it sees and then you can just use natural language to ask questions about what is happening um and that's a really really powerful way especially because this is all zero shots so you don't need to train a model um but this comes with downsides so let's say you want to deploy this in your factory um well these models are huge hundreds of billions of parameters for state-of-the-art models um they're slow um because you know because these models are so huge you Pro you need to run them somewhere in the cloud and there's real latency involved even with fast models like jet GPD 40 I mean it takes probably more than a second or something to get an analysis for a single image so that also means you can only get an update every second um that's that's for slow for a lot of cases um and it's expensive right you need to send raw data raw images to the cloud pay an LM provider like open AI to actually analyze that if you are doing that constantly 24/7 um preferably 10 frames a second or something you're going to make a couple of people really really really happy that operate these services so preferably we want something with very similar capabil abilities right the ability to understand the world and the ability to ask questions in natural language but running somewhere much closer or probably even on the edge I'd love to have a normal camera very simple small one right that I point on the factory floor everything runs locally on that camera which means I have super low latency right there's no overhead and super low cost I don't need to pay for any uh cloud service but how right these two things are fundamentally in compatible llms have hundreds of billions of parameters um so there's an interesting so here's an interesting approach right what if we use an llm lm's understanding of the world to train a much smaller model we we distill some of the knowledge that's in that insanely huge model and use that to make a much smaller model and in all fairness that's what we do at Ed imple we bring AI to the edge 300,000 plus projects already created on edge impulse of people wanted to build AI applications that are smaller and and can run self-contained on devices so how would that look right um well let's go build a project and and find out so yesterday I walked through my house as you can see here um and I have two kids a toddler and a baby and as you can imagine does all kinds of like in my house related um so what I've done is I sh a video just walking through my living room um and I can ask Chad GPT questions about this um and it's actually really good at that so I might say maybe I want to have a model that knows if there's a children's toy um in view yes or no so I give it a picture and I ask it is the main topic of this photo with children's toy if so only say yes if not only say no um or if you're not sure say unsure um because I want to programmatically parse this right and you can you can replace this with any other type of question about stuff that's happening in the real world maybe on a factory maybe outside is there a person how many people are standing in line Etc doesn't matter it's a it's a question that I want to ask the LM um and do another one on the on piano here and and so the the llm here C GPT actually understands both the scene and it understands my question and answers correctly but you can also see right this took 3 4 seconds to generate the answer that is very high latency um but we have a model that understands the Real Worlds and we can ask true questions to which is really powerful so how can we make that smaller uh um well let's go start by going into an aul project and uploading a video this video actually so we don't have any labels right this data is just unlabeled so I'll I'll leave that unlabeled and I upload this data so I've uploaded the video to a gulse and now I want to train a small model that can answer the same question as this big model so yes it's going to be much more constrained it can only answer that single question are there Toys on the floor or is there a toy in frame yes or no but it also means that it can be orders of magnitude smaller um so we split this video into images um we're left with about 500 items from this video um neither of them have a label but what we're going to ask is ask the llm um to provide labels for that and this is going to be the put to train our smaller model um so let's go to data sources um add a new data source I select the GPT 40 labeling block and just paste the same the same prompt as I had earlier into the chat GPT window um and we'll disable any samples with the label unsure or blurry we don't want them in our data sets um and let's label everything all right so that's done and now I have label data um using the llm so I can click here um this is a bad and it says label no this was not a children's toy one thing we do is that in the system prompt so whenever we ask um gd4 um um to also provide a reason and we'll add this metadata here so the main topic of this picture is a bed that's not a children's toy um so let's go find something that is a toy well here we go um yes and the main topic of this picture includes a large plush toy and a plush toy both of which are suitable for a toddler perfect so we even get some proper reasoning um from the model so especially if you have lots and lots of images here we have 500 it's relatively quick to just like scroll through um but if you have even more um there all kinds of really interesting visualization tools also AI powered in a gulse like for example the data Explorer where we take neural network embeddings from another model here a mobile net back end um and use them to Cluster your image so it's a new approach to um clustering this data set um and you see whole clusters of of yes and everything that's like On The Fringe like um here orange dots and blue dots that are close together we might want to like investigate those a bit more so we see here uh a yes that is actually far outside a normal cluster and it's very close to Blue cluster so here these two items are more or less the same photo um probably uh a second apart or something this one is labeled no this one is labeled yes and it's like only like a small Port of the uh of the toy in there so in this case might be an indication where we say hey we want to override the llm here I want to delete this item uh and based on it clean up my data set quickly much faster than clicking through stuff um by hand but in general actually this looks pretty decent like we see nice clusters um and because there's already a neural network behind this page it probably means that we're going to have really nice separation whenever we train a smaller Network um so I've actually done that uh here for time sake so what I've done is I have taken the input of these images resized them to 224 by 224 um and then use a much smaller neural network here actually one from our Nvidia tow collection so a pre-trained network with actually a pre-trained backbone so we're going to do transfer learning onto this model um sitting in here and this instead of hundreds of billions of parameters we have 800,000 parameters so that's I don't know how many orders of magnitude smaller than what's in a big llm um but that's because this is a very constrained problem right it's just a single question that we want to have answered um so we trained that and actually this runs about 10 frames a second on a Rasberry Pi all local no clouds required anymore um we show some metrics here so does that actually work then yes or no um well to do that I have shot a video on my phone again so really truly amazing thing in a gy pulse is if you want to test out something really quickly you can go to deployment scan the QR code and we'll build a web assembly package and you can run it on your phone so if you're not ready to deploy on real Hardware yet that's a really quick way of doing that so this is that model so here I'm I'm walking through my house there's a toy now there's no toy uh I point it here that is a toy that is a toy that's not there is a toy on the floor there so and when I point this outside um there's none of that and this runs you know one inference in 25 milliseconds so this runs 50 frames a second in the browser on my iPhone um perfectly so we took that Knowledge from the llm and put it inside a really small model it's really cool and we can go even smaller right so this is on a 224x 224 input using a relatively large mobile net backbone but we can make this depending naturally on on the size of the problem that you have scaled this down even more so here I actually uh resized in 96 by 96 so even smaller and I'm using an even smaller backbone and it means that I can even run this on a microcontroller something that is this size this can actually run the model that we trained by you know distilling that Knowledge from an llm we see a small accuracy drop right so it's going to have a couple of things more wrong but we can even run this you know about 10 frames a second on this little tiny cortex M microcontroller um in you know 232 kiloby of of RAM now that is that is truly different than using these big LMS and I have a little video of that as well so this is the debug feed from this camera running this model and I'm just walking around so there's a toy there's no toy so we're streaming this from this device time per inference about 111 milliseconds on a cortex M7 um and once again it has gotten this knowledge that was sitting in the big model and and transported it into this small model U really really well there's a toy there's not so incredibly powerful stuff I think um the the latest generation of these multimodal llms is truly astonishing but they're too big um for a lot of use cases but the power of understanding the world in the same way as we do through vision and audio and being able to ask questions in natural language to these small models is is truly truly insane um so what we believe at a gyul is that there is a huge opportunity to take the knowledge that is sitting in these big models and training smaller models much more specialized that can run faster low latency actually contained on device so much much cheaper to operate um and that's something we that we try to enable here in a gulse so our CET gbt 40 uh labeling block is available for Enterprise customers um you can get a free trial by going to studio. jes.com um and signing up for the Enterprise free trial um and go test it out and if this kind of stuff interests you and you think you have a perfect opportunity somewhere like let us know we'll be happy to think with you about how you can transpose the the knowledge that's sitting in these large llms and put them on real small s devices in the field thank you
Info
Channel: Edge Impulse
Views: 127,854
Rating: undefined out of 5
Keywords:
Id: Jou0aRgGiis
Channel Id: undefined
Length: 14min 3sec (843 seconds)
Published: Wed May 29 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.