Blazingly Fast LLM Inference | WEBGPU | On Device LLMs | MediaPipe LLM Inference | Google Developer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys in this video I'm going to show you how we can have a bling fast inference model in our local device so it'll be something like this let's have one more example where we are going to have a summarization of a text in this example we'll take the text from the Wikipedia so now you have seen how blazingly fast this inference model is giving you the output of any prompt let's quickly check how we can implement it or deploy it in our local system so if you come to github.com Google samples we can see there are multiple repos which have been um created by the developer team of Google and we will be interested to click on media pipe we'll go to Media pipe and then we can click on the examples and you can see a lot of examples here audio classifier customization phase detector but we are interested in LM inference so we can see there are four folders but right now we can go to the Java script and we can see there is these three files for the timing so one is the index.html the other one is the index.js and read me file so if you go through the readme file we can see uh this web sample demonstrate how to use the inference API to run common textto text generation task like information retrieval email Drafting and document summarization on web so we'll talk about the use case but by now you might have understood like what kind of uh use cases we can you know um make use of uh this large language model and the prerequisite is a browser with web GPU support like Chrome on Macos or Windows and running the demo so I have already shown the demo but let's go through it and let's follow these instructions so make a folder for the task named task LM task so if I click on my visual studio you can see that I have created the LM task folder and in that task copy the index.html and index.js so we have both the files here and then download the Gemma 2B which is a tensorflow light 2B it GPU so there are two versions of it one is the int4 and the other one is the int 8 and in total there are four large language models or variants which has been provided here the one is Gemma 2B which they have already uh you know quantized and provided the other models are F2 Falcon and stable LM so these guys they have provided this and they have also given us so if you come here you can see there is this conversion folder is there and it has LM conversion. notebook so this is a notebook where they have provided the code the model and select the back end whether it is CPU or GPU and then basically you can download and convert the model right and in your index.js file update the media file name with your model files name so if we go here we can see index.js and there's this model file name where we have provided the JMA 2B it GPU in4 so right now we are using Ino Gemma 2B so if you click on Jemma 2B you will reach you will land up in kagle and in the tensorflow light section you can see there is this variation there are these variations but for the time being you can only use GPU either int 4 or int 8 and you can click on the download button so that it gets downloaded in your device right and these are the disclaimer you can go through it and then this is the example use you can implement this experiment and a flow light model completely on device with the media pipe LM inference API then inference API acts as a rapper for Gemma enabling you to store and run the LM on device for text to text generation task so if you go back to the repo we can at the end once you have um basically done all these three things you can basically run this application using python mttp do server 8000 and that's is what I did here so you can see here all these apis are pleasingly fast because we are using web GPU so as you can see in the prate a browser with web GPU support and if you check it on chat GPT or you can Google it also you can get to know that web GPU is a modern web standard and API designed to provide low and high performance access to graphics and Computing capabilities on the web so it's a successor of webg so this shows that why media pipe package has used web GPU so that they can you know increase the Computing capabilities of these large language models and bring lingly fast uh inference from the large language so let's quickly go through the code so in the repo if you check the index.html you will find that there's not much we are only having this much code what I did is basically I have beautified this whole UI you can see this is the UI what I did is as always I copy pasted the code from here pasted it in chat GPD and I have just asked it like improve the you user interface and it'll give me the improved version of it where you can see like this this uh styling has been added so you can basically experiment with the UI part I'm not going to uh get into the detailing of it but um the intention of showing it um to you guys so that you guys can come up with um a better UI apart from you know instead of basically using this CI which has been mentioned in the code because it is very basic so apart from The Styling element right we have uh some some of the elements in the HTML so we'll see that one by one so we have this input inside the body so you can see this is the input and we have text area and again the input for get response and then we have the result and again the text area and the script which is basically calling the index.js so if we go back to UI we have this input and then we have this text area and then we have this button and again we have this text and again we have the text area so these are all components which we have defined which we have in the HTML and then we have the script where we are basically calling the index.js where all the logic has been defined so so if we go to index.js so here we have file set resolver and LM inference and we are getting it from the CDN from the media pipe Tas geni and these are basically uh the elements which we are getting from the HTML file the input output and the submit button we have defined the model file name so in our case we have it in the LM task itself this is the model and we are defining it in the model file here then we have this function which is display partial results it is more of a streaming function where we get the output and we are showing it on the fly on the UI itself and this is a main function so we are having a constant and then we are defining LM inference so this is uh basically on the click it is basically getting the generating the response from the large language model and then we are loading the model here and here we have all the parameters all the hyper parameters where you can play wrong with it so in Max booken it was 512 but for some experiments what I did is is I basically increased it to 2048 then you can provide some different hyper parameters like temperature random seed and topk so here we have the get response and if anything goes wrong we have a catch section where we are basically alerting that fail to initialize the task and this is the run run demo where the whole function is defined here and we are basically running this content so if you go to LM inference you can see Android and iOS so so those people who are interested in Mobile inference they can check out the Android and iOS you have this uh read me where you can see the overview and here you can see the screen where we have an interaction with the large language model and uh you can basically build the demo using Android studio I'm not going into details here same goes with the iOS so you guys can check out this so guys I'm going to end this video by talking about some use cases so those guys who are very much interested on um on device LM inferencing they can definitely use this uh in their mobile or in their iot devices and people who are working on rag they can definitely check these large language models because these are not so large and these are very lightweighted models and uh personally I felt like the performance of uh these LMS are actually very good it also depends on your your use case if the domain on which you your rag is working is very much specific to a particular domain then definitely you guys have to check it but um on a generic level I would say you can definitely go for it and check the performance of it how it is working on so that's it guys I hope you like this video and you have learned something thank you have a nice day
Info
Channel: Ayaansh Roy
Views: 216
Rating: undefined out of 5
Keywords: #LLMs, #AIIntegration, #Tutorial, #MachineLearning, #ArtificialIntelligence, #DeepLearning, #NeuralNetworks, #NaturalLanguageProcessing, #AIDevelopment, #ModelIntegration, #AIProjects, #AIApplications, #AIProgramming, #WebDevelopment, #AIInnovation, #SoftwareDevelopment, #mistral, #mistralofmilan
Id: G8vzGedNnro
Channel Id: undefined
Length: 10min 52sec (652 seconds)
Published: Wed Apr 03 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.