Install and run LLM's Locally with text generation webui on AMD gpu's!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there hbers and Engineers Fe engineer here today we're going to look at getting large language models running on a Linux installation if you're using an AMD GPU you will need amd's Rock M installed but let's take a look at what we need to do and let's get some large language models running we're actually going to use a web interface known as text generation web UI from uaba uh it looks and acts fairly similar to automatic 1111 for stable diffusion it is pretty easy to get up and running even on AMD gpus it is pretty straightforward so let's take a look and let's go if you need to install amd's Rock M be aware that it is roughly 20 GB in size also be aware that the large language model most of them range from about 5 to 10 gigs in size and so you will eat up quite a bit of hard drive fairly quickly but along with that let's take a look at installing amd's rockem if you already have rockem installed or if you have followed my guides for running stable diffusion go ahead and skip this part as it is going to effectively only be about rock m in order to get rockem up and running the first thing we will do is studo app update after that you will need to run a command to install get Python 3 pip Python 3 ven Python 3 Dev lib STD C++ after that run update one more time then we're going to W get the AMD GPU installer the W get command just simply downloads a file from effectively this place on the internet now that we've got it installed we need to effectively install that Debian package and the command for that is pseudo app install and then slash this amdgpu dode once you have done that all that you actually need to do is run this relatively simple command pseudo AMD GPU Das install use case graphics and rock M once those things have been installed you will need to run these two commands these two commands give your current user access to the video and the render groups which are used for rockem and also Graphics once you have done that you will need to actually reboot your machine pseudo reboot will work now that rockem has been installed and you have rebooted your machine you should be able to type rockem info and it should print out a ton of information and you should be able to find your graphics card inside of this information for me Agent 2 right here is this gfx 1100 and it is my Rx 7900 XTX you should be able to find something like that if you are familiar with Linux you can feel free and GP for whatever you feel like and you can slim down the results that way but this just tells you that rockem is in fact running it does see your graphics card and you should be all good to go so now inside of our terminal prompt we're simply going to make a directory and maybe we'll call it GPT so now that we have a GPT directory LS we'll see GPT we're going to CD into GPT and we are going to now start installing the text generation web UI in order to clone the repository we're simply going to type get clone and then paste in the rep repository that we're going to grab so here it is if we look now there is a folder called text generation web UI CD into that and it is a pretty straightforward install from here inside of the text generation web UI folder we're going to type python 3-m VM ven then we are going to Source VM bin activate so this puts us inside of our virtual environment we're going to install torch and we're going to use rockem 5.6 Rock M 5.7 is sort of out but it's not fully supported yet and it is still on nightly builds so you can certainly try Rockham 5.7 version of pytorch uh sometimes it works sometimes it doesn't I've had it both working and not working so you can uh well your mileage may vary 5.6 is stable and does work though now that we have the correct version of torch already installed there is a specific installer if you have no AVX however most people probably do but just know that if you run into AVX problems there is a specifically different requirements for no AVX now that everything is installed we're going to create our own launch script we're going to call it launch. sh you should open up an empty file like this your launch file should end up looking like this the first line says that bash is the way that we're going to run run this file Source V bin activate make sure that we're in our virtual environment hip visible devices equals zero make sure that it uses your discrete graphics card if you have an AMD ryzen CPU that also has an integrated GPU that integrated GPU absolutely cannot run rockem and it will crash setting this export hip visible devices effectively make sure that rock M and hip don't look at your integrated GPU and try to use that HSA override gfx version I'm running an rdna 3 card which is the 7900 XTX and for that card specifically the version is 11.0.0 if you're running an rdna 2 card you will you should put 10.3.0 in here and then this last line just runs our server go ahead and save the file exit out and then we're going to run a uh pseudo chod plus X launch. sh this effectively makes it so this file can execute and then after that all you should need to do is type dot slash and then launch. sh you'll see that it is running on a local URL over here on hugging face you will see just the hugging face models page and from here what we are actually looking for is down here in natural language processing we are looking for text generation go ahead and click on that and then you will see tons of models showing up so the way large language models are effectively classified you will almost always see a something B parameter 7B for example says it was trained on 7 billion parameters 13B is about double the size 34b is about triple the size of that the more the higher of a b number the more parameters it was trained on so the better of a model it is in general but also the more vram it will take just to load it up I would start with a 7B model and just so that you are aware these things range in size from 5 to 10 gigabytes each in order to download and load any of the models all that you need to do from inside of the text generation web UI click on the model button and as you'll see right now I don't have any models available all you need to do is copy from hugging face the little button up here that will tell you exactly which model it is paste it in here into text generation web UI and hit the download button but remember these models are quite large they are many Gab once your model has downloaded you should be able to hit the reload button here and you should end up seeing your model available and this is where things get a bit tricky so the very first thing you should do just run it as Transformers don't change anything in here don't do anything at all and just click the the load button you may end up seeing some sort of problems every single one of these models these loaders that you see there's a whole bunch of different ways to load in the models and these are effectively wrappers and every single one of them with every single model will end up being a bit of a trial and error setup now sometimes you'll just get errors and you'll just be able to effectively make some changes try things differently and reload other times the core will dump if the core dumps you will have to actually run your launch script all over again to reload uh for this the bloke wizard vicuna 7B I have it just on transer Transformers I train I changed compute dtype to float 32 I changed Quant type to fp4 and I clicked the disable ex llama kernel and when I loaded it that way it loaded in just fine so you'll see a message successfully loaded go over here to chat give it a prompt tell me about yourself it was getting 23 tokens a second write me a story 36 tokens a second write me a longer story now one of the things that you'll notice very quickly when working with these models is that they may just kind of cut off you'll often see that it cuts off at around 199 or 200 tokens that is because of parameters I will say that in order to get it to write somewhat long responses and in order for everything to work again there will be an awful lot of trial and error sometimes you will have to effectively kill the kill the web UI and you can do that by hitting contrl C when you control C it will kill the command entirely that will stop your server and then reload your launch script other times you may be able to just reload the model or unload it and then load it I have not found a lot of rhyme and reason some sometimes it works great sometimes it works for a while other times it just kind of dies off uh but in order to get somewhat longer prompts and somewhat longer responses uh changing your max new tokens and truncating the prompt up to a higher level as well as Auto Max new tokens seems to be the best ways to go about speeding things up and making it so that you can get longer prompts um uh some other things to note is that we loaded the model with Transformers I have found that this ex llama HF in general tends to work fairly well although you will be somewhat limited on the max sequence length uh I believe it has to do with the vram that it will take up so we loaded with 10,240 as a Max sequence length parameters have been updated a bit so we'll up Max new tokens and have Auto Max new tokens and let's see if it works and you'll see we were at a blazing 65 tokens a second the entire response was 382 tokens and it generated all that in 5.81 seconds which is pretty great so that's it folks that's all it takes that's how you get large language models up and running on your Linux installation with AMD cards using AMD Rock m and you'll see that the performance is pretty great overall you know 65 tokens a second is fast so I hope this helps and I hope everybody enjoys it have fun playing with your new toys bye thank you so much for watching my videos home labbers and Engineers I create and edit all these videos on my own so any likes and subscribes will massively help out the channel and allow me to continue creating content to help people if you got value out of this please consider liking and subscribing to the channel to be notified when new content drops if there's something I've not covered but you would like to see a video on it please leave a comment down below and again a massive thank you to everyone I hope you have a great day
Info
Channel: FE-Engineer
Views: 7,541
Rating: undefined out of 5
Keywords: GPT, LLM, Ubuntu, AMD, AMD GPU, Llama, Vicuna
Id: lwndqsUok2s
Channel Id: undefined
Length: 16min 48sec (1008 seconds)
Published: Wed Nov 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.