How to Run LLaMA Locally on CPU or GPU | Python & Langchain & CTransformers Guide

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello guys welcome back to the channel so in today's video we're going to do a actually a very interesting project so I'm going to I'm going to go ahead and show you how to host Lama 2 locally on your machine we're going to be doing this using land chain so if you probably work with lens before you probably know that most airlines that they are actually paid like open Ai and other stuff like that okay I can't open Ai and across prohibited sign between it because uh just to show that you're not using open AI we're going to be hosting our own element because I hate open I actually love open eyes okay so I'm going to show you how to host your own llama locally on a either using a GPU or a CPU device okay so it doesn't matter whether you iterate you have a computer with a GPU or you have a computer with a CPU or in my case I have both of them I have both of them I have a GPU and a CPU so I'm going to show you how to host all of them on these different platforms and also learn how to integrate interact with this run write different prompts to the Llama 2 model using LinkedIn okay so let's get started with this so the first I'm going to go ahead and do let's just look at the outline of what you're going to be doing so I'm going to be looking at how to load our Airline using a CPU you're still going to learn how to load it using a GPU identity for additional bonus you're going to learn how to reload it using a Transformer so if you're not a Transformer Transformers are don't worry and then finally additional bonus we also learn how to use a line chain link chains llm chain to interact with this so after this the type of right here this actually llm chain okay so if you ever use LMT any long chain you're going to be using that to interact with our llama 2 model that you're going to be hosting locally on your device okay so let's look at the outline of what other things are going to be doing and then let's go over it one after the other the first we're going to be going ahead and do we're going to go ahead and download the Llama to a model you're going to be using you're going to be using the seven seven billion parameter model because there are different models at 70 billion parameter model there is I think 13 billion parameter model in this case we're going to be using the 7 billion parameter model okay so the first thing you need to do is actually you need to have this model download because you have everything running locally so you need to go ahead into hugging phase and then you're going to download the Llama 2 window with 7 billion parameters so how do you do that so I'm just going to go ahead and go in my window right here so I'm just gonna I'm gonna add this link to the description of the video so don't worry let's go in the description of the video and just click on this link right here so this link brings you to a hugging phase and you're gonna bring it to this website right here so once you are there you can see we have this llama monitor we are going to be using okay so just go ahead and press the download to download it okay and it's going to uh it's actually big it's gonna be seven uh seven approximately 7.2 megabytes okay so I've already downloaded this already so if you haven't downloaded it make that you have this model already downloaded okay so it's going to take a bit of time so just download the model and you can come back and watch the video because you need the model to work with this okay so make sure you have the model downloaded okay okay so I assume now you have the model already downloaded so let's move on to how to load it on our CPU okay so I'm going to show you how to load the Llama to model that is that you have just downloaded into your CPU okay so to do that uh first of all we need to install a couple of libraries I'm going to be using long chain for this so let's just go ahead and make sure let me show you how to download the library that we need okay actually install not download this so I'm going to be using uh the Llama llama.cvp so a long chain that you can see on the official documentation of long chain and there is how this thing called llama CPP Python and this Library we're going to be using to actually interact our llama 2 model so you can see it's right here so we can just go ahead and copy this right here and go ahead and install it okay so let's just go I'm going to copy that and let me just go here and then install it so I'm going to I'm in here already so one thing before you can install make sure that you have the latest version of uh La you have the latest version of linking on your machine you also have the latest version of python like python 3.8 and above okay so python 3.7 you have issues okay so this is python 3.9 pattern 3.10 and above okay so once we have this done I'm just going to go ahead and copy I'm going to copy this or you can just click here to copy this and I'm going to go into my terminal right here instead of my terminal I'm going to be using Anaconda so make sure that you have anaconda or you can just use a python a normal python virtual environment that has python 3.8 and above okay so I'm going to go in here I'm just going to go ahead and activate my contact information so let's say conda activate I have a variable called llms I'm going to activate LM so you can see now it's activated right here so I have this environment already activated now I'm going to go ahead and simply install this requirement that we just have right here okay so I'm just going to make sure you copy this and simply go ahead and make sure that you have it installed so paste that there and press enter that's going to go ahead and install so I already have it installed I just installed it yesterday to make sure that I'm on the look I'm not the most updated version so you can see the requirements already satisfied okay so make sure you also have Lang chain installed so if you don't have Lang chain store you can just do pip install line chain and you can also go to a few documentation just go to docs and just simply go ahead and just click on installation and just do pip install long chain to have LinkedIn installed if you're using Anaconda you also see the command to install using Anaconda okay so make sure that you have a you have a lanky installed and also you install this library right here this llama CPP python which is also going to be using to load our llama model okay so if you want you can also if you want to install the CPU version this is the instruction the command to install the CP version I already have this already installed already so I'm not going to go over it okay if you want to install this in meta if you're on the MacBook you can see there is different instructions so if you're using a Mac OS using the Silicon chip this is what the comma that you need to run to install it okay these two commands right here if you're on a Windows it can be a bit complicated to insert on Windows okay but I'm not going to go how to do this on Windows okay so once I have this done uh I think even on Windows you can just use the CPU version just simply do pip install a llama CPP Python and that will be fine okay you have to go ahead and do all these steps right here okay okay so once we have that installed already what I'm going to go ahead and do is begin to cut so I'm just going to go into my terminal and I have my terminal opened right here so I'm going to create a file it's called we're going to call a llama uh underscore CPU okay dot python because it's going to be using this to load our CPU then you have another file for GPU and FN Network I'll show to integrate this with a long chains llm chains okay so I'm just going to go ahead in here so sorry I'm just going to go in here and I'm going to go ahead and copy this right here so copy this right here copy it and simply move back into it I'm going to go ahead and paste that in there so this is the import that we need to have this running okay so this is giving me a warning because it doesn't know which environment I'm running okay so the best thing to do the easiest way to do this is just simply close your your your vs code and simply open it again so I'm just going to go ahead in here and I'm just going to go out because the reason why I'm getting that because I didn't activate my virtual environment before opening up my vs code okay so now that I have my virtual environment activity it doesn't really pick it up so easiest way to do is simply just close the vs code and open it up again or you can also go in here and just simply say this way and then you can go ahead and select the environment that you want to use which is in my case I want to use llms Anaconda environment so I'm going to go ahead and select that and once I have that selected I'm going to go ahead and hit save again and let's see if it's going to pick it up sometimes it is an issue so yeah now it picks it up so we are good to go okay so once we have that there we have all our Imports done now let's now let's begin to code and let's try to load up that LM model okay so make sure one thing before you begin you get started with this is that make sure that you have your model or your the model downloaded and you have the path to that model okay okay so I assume that you already have that so let me just go ahead and write here the model part so I'm just going to say model underscore path and I'm just going to go ahead and add in my models path right here which is going to be iOS for slash home I'm on the Linux machine so I'm going to say home first slash uh Prince so then finally I'm going to say download so I have it in my downloads okay downloads and then first I have a llama basically basically I can just go ahead and actually get the name from in here so sorry I'm just going to go in here and I'm just going to go ahead and get the name right here so this is the model that I want to use so I'm just going to copy it copy that and simply go back in here and simply delete this and then paste it in there so that's the model that I have downloaded and that's the part the models of fastlash home for slash prints for slash downloads for Slash and then the name of that model okay so in your case let's find the parts that that to default that you just downloaded from hugging face okay okay so once we have this done I'm going to go ahead and we're going to do my two main things the first I'm going to do is create a function to generate a prompt okay and then finally we can go ahead and create a function that simply loads the the lamba 2 model okay so whatever order in which you want to execute this it's all fine okay so basically let's go ahead and begin to actually we don't need if you don't you don't actually need a function to load our uh to create a prompt we'll use that later on but for now we don't need a function to do that so all you need is you just need one step which is great uh create a function to load uh sorry to load uh llama model so that's all we need to do so let's go ahead and create that function something div I'm going to call it load underscore model and then you can begin to code right here so we're going to load this for mainly CPU devices and not any other any other okay so only CPU so I'm going to go ahead and simply load this right here okay so let's get started so we can actually keep a a dog string writing let's say load model okay so loads uh model so just like that so let's say this loads llama model so llama model so that's what you have to do okay so once we have that I'm going to go ahead and we need to create a callback for this to work so we need to create a caller because I want to go and actually stream the model's output onto my terminal so I'm going to create a callback for this I'm going to say call back and I'm going to create callback it's going to be equals to callback and this is a callback manager that we have imported right here so we have callback manager imported as we're using okay so callback manager you're going to go and pass in a string which is going to be stream uh that's basically it and then make sure we call this function so this is the function we just imported right here okay so let me just close this so that you guys can have a better view of that okay okay so once we have the Callback defined now let's go ahead and simply load our model so I'm going to say I'm going to uh basically our llama model underscore model okay and this is going to be equals to the fullness I'm going to say uh L uh this physical right here so uh Lama CVP and I'm going to go ahead and simply call this function I need to pass in a couple of things so then when the first one I need to use the model uh model path and I'm going to say model for a model uh path and the model path is going to be the model part that we defined above which is just the path to the model that we just downloaded okay okay so once I have that I'm going to pass in you can pass in that things like temperature you can either add password other stuff like that so temperature and if you know what what all this means I have all this explained in my introduction video to uh to launching so you can check that video also okay display at the maximum number of tokens to be 200 tokens 2 000 tokens sorry and guess by uh top uh top P underscore p and it's going to be equals to I'm going to say this to be one okay topic take an FNA pass in the callbacks so we're passing the Callback manager and a callback managing to be our callback man that we just created above so let me just actually rename this again let me just call it callback underscore manager that makes more sense okay so I'm going to use that callback manager right here so good so once I have the call manager specified I can just specify your boss because the boss is needed for this to work because you're going to be streaming the output of this model to our terminal that's why we need levels to be true okay so uh the boss is required right you have to pass in the Box it will be true for the Callback to work okay so the Callback basically the Callback that we have here basically it supports talking while streaming okay so it's supposed to get my stream that can stream the tokens to a terminal and for this we not we need to have the balls to be on to be true okay so once you have it done I'm just going to go ahead and return uh the the model that we just created so that's that's basically what this function does creates a model and then simply returns to ask the model so we can go this way if you want you can do typing stuff you can go ahead and say this is going to be done to us uh llama CPP object you can you can also see that this can be of type uh Lama CCP object just like that so that's if you are using if you're interested in using typing today that's how you can type in this function right here okay so this is this gonna view so you can also type in this and say callback manager and then you can do all that okay so that's that's how we can type into our code if you want to so uh that's once we have that done I'm just going to go ahead down here and you can see me begin to call our model so we can call the model and then simply return the output to the front end so again let's go ahead and simply call the model the first you need to call the model we need to load the model so let's say LM which is going to be our lab model it's going to be called a load model and it's called this function that's going to return to us the Llama model right okay so once you have the lemon butter I'm just going to go ahead and write a model underscore prompt and the model prompt is going to be a string so I can say uh it's going to be a step string and it's going to be I can go ahead and write this string I'm going to use a string layer basically uh uh like a string with light I forgot the name of the strings in Python uh basically this kind of strings I forgot they have a special name of them okay if you guys any of you guys remember let me know in the comment section below I just keep my mind okay so I'm going to have you I'm going to go ahead and pass in question okay question the question is going to be equals to what is the largest uh country on uh so I'm just going to ask it any question okay so you can go ahead and actually ask it any question that you want I'm just going to ask it what's the largest country or not you can go ahead and ask it any question that you want so once I have I'm going to say the response I'm going to store the response uh response to number of type string and this is going to be equals to the full length so I'm going to say llm and basically pass in our model question so model prompt and then that's basically what we need to do and that's going to go ahead and simply return to us and again go ahead and print the model prompt if you want to if you want to you can go ahead and print it out if you don't want that's also fine so that's basically what we need to do so that's it so yeah you have created a model create a function that loads the model we call that function to get the model back created a prompt and then simply call the model with that prompt and again so there are the output instead of here you can also just remove this if you don't want you can just leave it like this it will also still work because you have the Callback that's going to stream everything to our terminal but I'm just going to leave it like this okay so once we have that done we are good to go to run our code and let me know let me just let me just tell you some warnings before even begin to run this so the first thing is that yeah so the first thing is that uh it's going to take a bit of time to run this okay because it's not running on the CPU and I'm also recording so it's going to be taking a bit of time to run this so make sure that your computer if you are having performance factors make sure that you put it on performance mode just to get the most power out of it so you can see I'm on my performance mode because all my gpus and every everything is going to be on maximum capacity you're going to be running at full speed okay so make sure that your performance is on so once I have it then I'm just going to go back in the terminal clear this up and they simply save Python 3 and then run that okay so run that and this is going to take a bit of time like I told you so it's gonna you can see that executing and it's going to go ahead and take a bit of time okay so let's give it a bit of time and it's going to wrote uh do everything and then begin to answer our question so let's give it a bit of time and let's see what the output that we get back so we can see answer and begin to answer our question basically it's streaming everything to our table because of the Callback manager to use so you can see the largest country on earth is Russia which covers an area of approximately 7 17.161 17.1 million square kilometers so you can see we get back the output that we want okay so it's giving the value in well so we can see we get back the output right here okay so just ignore this as a model output so just ignore this so basically I shows you that the model performance like the load time in milliseconds runs and all that information because you can see all that all that against the most tokens I've used so yeah so you can sell the token you can you can see everything that you need uh total time in this report three about 30 seconds okay so that's basically it so that's basically how yeah you can basically use a model right there to get prompts so you can see that is barely using CPU so that's basically how to use CPU for your llm models okay okay so once we have of this done we can also go ahead and see how we can use this in GPU so let's go back in here and let's see what you need to do so we have seen uh how to load it on CPU now you're going to see how to load it on GPU and if you're using GPU go back to documentation and make sure to have the GPU installed the GPU version is so you can see install open using GPU so if you're using keep use up using GPU or code or basically make sure that you have this uh already uh activated for you to before it work okay okay if you're using MacBooks or make sure to follow the instruction to install it for MacBook as well but if you're using Windows at the instruction also here in the update for Windows a bit difficult to do I don't know why Windows is always like this but yeah that's how you can do it but for CPU you just have to run this and which we just did so now let's move on to how to do it on GPU and if you are complete your computers have a GPU feel free to skip this step and go to the next step we are going to go ahead and learn how to use C Transformers in launching so if you have for guys who have gpus stick around and watch this okay because once you have that amazing open my in here create a new file I'm going to call it a llama uh underscore GPU dot Pi okay so the GPU version okay so once we have this done you're going to use a code that you already have so I'm just going to go in here and copy this and then bring it in here okay so I'm going to have the same code but with a bit of difference so the major difference is going to be on this part we are loading the model okay so the first I'm going to go ahead and we're going to set number of GPU layers that you want to allow so this is basically uh this is the valve that specifies based on your model and your GP Ram so basically on your vram pull you're going to specify this value you're going to say G underscore GPU underscore layer is going to be equals going to say 40 okay so I got this values from the offshore documentation and for me they work perfectly so I'm not going to go ahead and modify them so I say batch and the batch is going to be simply uh this kind of bash is going to be the number uh the number of tokens right the number of amount of vram in your GPU basically so this is going to be the amount of uh this is what documentation says that uh should be between 1 and see uh basically uh the amount of vram in your GPU between one and amount of the amount of vram in your GPU okay so that's basically it's once we have this right we're going to go ahead and use them in here so I'm going to go ahead and specify them initially say NG right NG layer so I can say NG layer so I was into the NG layers right here and you're going to copy this and assign it right here good you're also going to pass in the number the number of batch uh sorry basically and and underscore batch and N batch you're going to say n batch that we have up there so end batch uh that's what you have right here so input right here so I actually got this right node path uh batch okay so like this and then like this okay so that's it and let me don't forget to semicolon right here okay I'm in a column so once I have all those specified in here we are good to run on a GPU device okay so that's what we need to do to run on a GPU so make sure that you have everything else Remains the Same the only thing that changed is that we added this and this inside and this arguments right here that's basically what you did so once you have done that you're ready to run on a GPU device okay so let me just go back in my terminal and then clear this GPU and then run that again okay yeah so you can see it was a bit faster loading up here so once it loads up so you can see it begins to answer the largest country on earth is Russia right so Russia then which has covers an area of approximately 17.1 million yeah square kilometers right into about 6.6 miles square miles okay so good so you can still get back at the output there so our model works fine whether we're using a GPU or using a CPU okay so that's how to run it using a GPU okay so once we have that there's no much of a difference between I haven't really measured the much of a difference between the tubularizing the CP of a GPU but I know when doing a heavier task the GPU performance really come into play okay so that's basically how to load this thing using a GPU okay so once you have done this thing using a GPU now you're going to I'm going to show how to use a c transformer for this okay okay so let's move on to C Transformers and let's see how we can use this with C Transformers and also I also promise that you're going to show you how to do this with uh llm chains in long chain right so you can actually go ahead and see how we can do this thing with LM in launching before we move into C Transformer so let's go ahead and see how we can do that okay so I'm going to go ahead in here I'm going to create a file I'm going to call it llama and I'm going to call it underscore it's going to be uh llm chain dot Pi okay so run that so once I have that done I'm just going to go ahead go back in here I'm going to go ahead and copy all the code in here bring it in here okay so this is basically the same code that we have so the the couple of things that I'm going to change uh the first one is we need to create an Alm chain and then pass in the model that we they basically the Llama model into a llm chain as the llm of that llm chain okay that's a bit of LMS a lot of elements in there but yeah let's go ahead and do that so the first thing we need to go ahead and load our model which this function does so good we are going to close our model and let's go ahead and load the model in here so once our model is loaded and what else what one thing else I want to do is that we can go ahead and actually use it in here okay so let's actually change this calling let's just say llm underscore chain is going to be equals to llm chain right remember that you have this imported so if you don't have this imported we should just say long chain of long chain uh import llm chain okay so we're going to use that right here so LM chain you're going to go ahead and pass in a couple of things surpassing our llm and which is going to be our uh our llama chain so our llama model basically and then you can also pass in prompts if you want to so you can actually develop a prompt let's create a function that results from so that's going to want you want to add to then we create uh create function to load prompt okay okay so once you have that let's create a function that loads the processor there and you're going to call it load uh you can call it create prompt or load prompt but if you feel comfortable so let's say create underscore prompt just like that and you're going to go ahead and call this and it's going to return to us a prompt template at the end of the day okay so that's again how you can hit that screen also pass in a doc string writing let's say create create prompt template just like that so once we have that done what's going to go ahead and we're going to have a problem so I'm going to call this one uh default underscore template okay and this is going to be of type 3 so string and I'm going to pass in a bunch of strings right in here so I'm doing I'm just going to be using a prompt of official documentation let me just add this in here so I'm going to say a prompt copied uh from long chain docs okay let's give them the credit so once I have this I'm just going to go into the launching docs right here and I'm going to use this prompt that I have so I just saw this prompt and it really looks cool so I'm gonna I'm gonna be using that prompt okay so I'm not going to be using this prompt right here copy that and then simply go in here and then paste it in here so a bit of spacer and then simply paste it in it so let me just do this okay and also I'm going to do this okay so all you need is a human we don't need a history because I'm gonna you're not gonna be maintaining any memory if you guys want me to show how to maintain memory uh let me know in the comment section below I will do my best to get back to you guys okay so once we have that I'm going to go ahead and simply create the prompt right here okay so I think I need to format this a bit so this way yeah okay so I'm going to go ahead and say prompt and this is going to be prompt uh it's gonna be from its template object and it's going to be equals to so this is going to be prompt uh template and I'm going to pass in the following so I'm going to say input input or let me get that right so input variables underscore variables I'm going to pass in our input variable which is basically what we need right here which is basically this in here so if you change the if you change this to be let's say this equation then inhales we have this question okay so questions whatever value you use in here must align with the value to use in here okay so once we pass in the the input variables also going to pass in our template now a template is going to be our default underscore default template that we just defined above there so that's constant so once we have that you're going to go ahead and simply return uh then upload a prompt which is basically a prompt temperature right okay okay so once we have that I'm just going to save that so we'll get a bit of formatting layer so once we have that done we can just go ahead and use it in here so I'm going to create a prompt so I'm going to say a prompt is going to be equals to the number of type prompt template and it's going to be equals to uh call it great prompt just like that we call that function and inquiring pass in the prompt right here anyway prompt you don't be the prompt that we just created above here once we have that you can just go ahead and return the llm uh llm chain or LM chain this is this right here which is this object so I'm going to be turning an LM chain so I'm going to say llm chain okay so make sure we change that to be an llm chain okay so once we have that done I'm just gonna go ahead and save this so that's basically what we need to load in our model okay so once you have this done let's go and load in our model so this is going to return to us llm change so LM underscore chain is going to be equals to loading the model right here you're going to go ahead and simply call that model okay so now to call the model it's going to be a bit different so I'm going to have a simple prompt right here so I'm going to modify this I'm going to just change this part a bit and just have that okay okay so once I have that I'm going to go ahead and simply call this right here so we'll say I'm going to say llm chain dot run and then simply run and we pass in or basically uh from this template okay basically our prompt I can call this prompt or if you want to call it question whatever you want uh that's basically up to you so once you have that done that's all we need to do so we load in our model we're passing a question that you want to ask the prompt and then lmt.run and we basically run this okay so this is going to stream the output to the terminals we don't you don't have to actually store it in any variable but I'm just drag it in a variable okay yeah you don't have to actually do that yeah but I'm just doing it okay okay so once we have that done that's all we need to do so let's go ahead and run this so I'm just going to clear my terminal and let's do Python 3 and let's say let me see what they have in here uh let me see what I called it I called it uh llama llama and then chain okay so that's it and each run okay so we get our model loaded and then begin to get an answer to our question so one uh one thing I noticed whenever you're using the llm chain it takes a ridiculously long amount of time I don't know what prompts the LM chain is running the background but it takes a ridiculously long amount of time to get back and answer you can see still now we haven't got back and answer yet uh by by traditionally if you're using if you if you weren't using the LMT by now I would have gotten back an answer okay so it takes a long long time before using the lmg so I'm just going to pause and wait for this stand back and then I'll continue recording yeah so you can see now our model Returns the NASA against the largest country on earth is Russia covers an area of seven approximately 17.1 right I actually memorize this value uh square kilometers okay so that's basically it we can see when using uh when using long chain uh long chains llm chain our model takes much longer to respond like but you can see the advantage of using it as well is that it is it it has not only tell us just this part of the of the of the response which is what we get bonusing a normal uh without using the LM chain we only get back this part the highlighted part but if you're using the LM chain it gives us more response back so you can see it was about uh North Europe so you can see that's the advantage of using it so it's probably making a lot of much more uh queries or prompting than we are we are doing ourselves we can get back a more in-depth answer okay so I'm not sure about the validity of this answer but yeah that's basically it and that's the other advantages and disadvantages right so you can see it right there okay so it's going to take a bit of time and it's going to return to us a bunch of other things that we never asked for like I never asked for that question like I never said that but it's going to give me back this thing so this sometimes I'm also researching and try to understand why are we getting back this in case we never ask it to do that but it's returning to us this information which is a bit strange okay so I'm actually doing more research into it to understand why it has been the case and I'll let you guys know whenever I have the answer to that okay so that's one of the limitations of using uh of using uh of using llm case Okay so that's it so once we are we have that out of the way one thing I want to do is also uh basically move on to the C Transformer so we'll use the C Transformers to load in our models as well okay so we have already installed sitter we already installed this already so I've already done this first now you're moving on to see Transformer so let's go ahead and see how we can use C Transformers in here okay so we can go ahead and create another file India and we'd call it this is b a slow slow down my machine so let me see if I can stop this yeah so I don't want that it's why it's slowing down my machine so I'm just going to go ahead and say llama and NBC uh Transformer slot Pi okay so once I have that I'm just going to go ahead and uh I can just go in here and let's copy everything and go in here and simply paste everything so the only difference I'm going to be using SC Transformer in this case okay so what you need to go ahead and do is make sure that you have C Transformers installed okay so you can go to long chain and then search for C Transformers and this is how you install so pip install see Transformers okay so make sure to have that installed I already have it installed already so I'm just going to go ahead and import it and you can begin to use it in our code okay so I'm going to go in here and just load in that c transformation I'm going to remove the prompt template and the LMT because we're not going to be using it in here so paste it in there and then in here I'm just going to go ahead and change this one and two as a c Transformer object so I'll try GC Transformers object and let's see what I'm going to go ahead and do in here okay so let's see I'm going to have C Transformers right there I'm going to go ahead and use it right here so I'm going to say see Transformers and it's going to be of type c Transformers as well so yeah okay so that's basically what we need to do you need to pass in the model name you can pass in temperature if you want you know you can pass in this okay uh that's basically what we need and if and every time it was you return the model and here down here what is I'm going to go ahead and do is simply get that model and simply ask prompt the model to uh ask the model equation is if we expect the model to return to us an output right so that's basically what we have done in here and let's go ahead and actually run this as I called it llama underscore C Transformer so I'm just going to run it right here and say Python 3 uh llama underscore NBC Transformers right here and then run that code okay so that should be fine so we got an error right there so I think I need I don't need a couple of things in here so the first I'm going to go ahead and delete from in here I'm going to the uh I don't need the boss so I'm going to delete that I don't need this I don't need this I'm going to give my maximum token I don't need this right here uh yeah I'm going to delete this as well I'm gonna name this as well delete that as well so keep the code much much simpler so guys let's see we're going to load in our model we pass in our model let me see model uh type and the model type you're going to Simply say is going to be a llama model so model type is llama and um max tokens you can leave this one to be best videos that this number to be 512 just to reduce that number tell me how you can leave them just to be the same as we have and we also get I think let's play file let me see what argument you have right here again passing a callback Handler okay a callback Monday if you want to you can pass in other stuff that you want so you can pass in uh you can read through the to the documentation and you can see uh what other stuff you can pass in okay okay so once you have that let's don't forget the comma right here and also you can simply say the balls unless the voice is going to be true and once you have a the boss will be true we can also go ahead and actually use a basically a manager right here basically let me just copy the code from here you can use a callback manager so again copy that go in here passes the Callback manager in here and you can go back I say call callback manager and pass in a callback callback uh underscore manager so call back let's go manager I don't know why he's not picking this up so I'm going to copy this and send me faces in here so that's it so once you have all that you make sure that you don't have any error code and you can go ahead and actually run this so let's run this one more time and hopefully we don't get any error this time around so we still get an error I say the let me see so C Transformers line environment variable model so key you're getting model error so let me just see uh what we need to do so we provide the model the model okay this entirely model part is just model okay that's the model remove the model all the sorry the parts the model okay and then the model type and then the other thing so max okay this is actually not Max token is actually Max uh Max new underscore token okay so max news okay and uh yeah so if you don't have any errors that this should hopefully work so let's try this out let the terminal run this again okay so it looks like yeah it's working fine so it's gonna takes a bit of time to load in the model and then once the model is already in we're just going to go ahead and call the model right here and pass into the prompt basically what is the largest country on that that's going to go ahead and call this prompt right here so so you can see get back an answer right here it says that the Russia is uh the largest country on the on Earth yeah so you can see uh is doing that one so 17 point uh 1 million square kilometers actually let me raise this in my head okay so that's what we have 6.6 miles something okay a million square okay okay so you can see just returned to us uh it spans across much over much over nothing Europe right so good so that's basically how you can also use C Transformers in this case so that's basically so you can see how we can load La much more using a CPU we are going to uh GPU available how to use it with uh land case LMK you also you have to use it to a c Transformer right which is this right here right and you're getting back to all the outputs right here okay so you can see all that works fine so we have covered basically everything in here you have learned however you have downloaded a number two model with 7 billion parameter ones you can download a bigger one if you want to but make sure you have a bigger CPU or a bigger computer you uploaded it on a CPU we have a GPU it with a c Transformers uh we have this installed llama CVP python you know so let's now one more thing is that you can see this model is actually very heavy right I'm still having a very powerful computer I have 280p of graphics card and also I have 24 GB of RAM so that's actually uh it's not the most powerful but it's quite a bit powerful than most computers out there so you can see and it's taking a bit of time right in the second bit of time and my computer is really heating so when I'm recording the video I says really heating up right now so you can see one of the disadvantage of that model is that they are really big and require a lot of computers computation or computational resources to run right so that's where the idea of quantitization comes into play so quantization is that the main idea behind quantization is that uh it's a powerful technique in using machine learning and deep learning between mostly it involves representing numeric data so for example if you have ways and biases right weight and biases or activations these are represented with numeric values now we can represent the same new numeric values with a lower Precision so we have things like N64 we have things like in six into eight into another values of ins right so all these values have a different Precision of values that they can store right so the family for using eight bits it will have a lower pressing comparative phase in 32 bits right or if you're using it 64 bit to have a higher prisoner than using an 8 bit integer value so that's what the idea of quantization comes into play you reason you use the the Precision of the numeric values that I use within the model which are the weight basically the weights and the biases the activation has other stuff like that so if you reduce these values then you have a lot less so less memory it's memoryless more memory efficient they're less heavy so this can using different things like whenever you want to maybe deploy an app on a mobile mobile application or you want to run this on a Raspberry Pi or something like that right so why you should use the quantized model so models that are going to go that go through the idea of quantization are called a Quant a quantized model right so these are one of the advantages of using such a model so it reduces memory storage right because we are using lower pressing values I reduce reduces computational costs so you can run this on a lower device so if you're doing this on a like in the cloud then you have to pay less for your your usage uh it's also deployment of as on embedded devices like mobile phones Raspberry Pi so you use a quantitized product and so it's also good for energy efficient like less computation less computer computer resistance you're using less energy you're using right so it's good for the climate as well right okay so and then finally uh much more advantages there are other advantages out there so basically that's all really I wanted us to cover in this course so in this video and I hope you guys have installed now understanding of how to load a llama to model on your own computer so let's check if this actually stopped running yeah it's actually stop running so you can see this answer we get back okay so I can Retreat and I can also verify if you want to okay so that's basically why you want to use a quantized model or basically use the idea of quantization right to get a quantized model and these are the advantage of it so guys that's all I want to discover in this video If you guys enjoyed this so far make sure that you like the video subscribe to YouTube channel you share this video with anyone who might find it helpful also let me know in the comments in the comment section below whether a negative comments or positive comments they're all helpful uh for me to make better videos in the future so let me know what you guys think on the video and if this video helped you helps you make sure that you like you subscribe and you share the video with anyone within my friend yourself thanks for watching and see you in the next one keep safe

Info

Channel: Code With Prince

Views: 11,955

Rating: undefined out of 5

Keywords: codewithprince, programmingchannel, python, python devs, funcoding, Llama, How to load Llama on CPPU, how to load Llama on GPU, How to load Llama using CTransformers, How to load Llama using CTransformers in langchain, how to load llama 2 model, how to load llama model using langchain, LLMChains with Llama-2 model, How to use Llama-2 model in Python, langchain, load Llama-2, LLMChains with Llama-2, CTransformers, NLP, LLM, LLM Projects in Python, LLM for beginners, langchain for beginners

Id: SvjWDX2NqiM

Channel Id: undefined

Length: 39min 51sec (2391 seconds)

Published: Wed Aug 23 2023