Zero to Hero LLMs with M3 Max BEAST

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

just have the M3 Max on my desk because this is what it's really going to be good at machine learning I'm going to do some training in later videos but for today I want to do some local llm installations kind of like having Chad GPT on your own computer without paying for it all the stuff that we're going to be doing today is free except the M3 Max MacBook Pro that's expensive I'm going to take you through the installation process and in the second part of the video I'll show you a really easy installation with LM studio for those folks that don't want to mess around with scripts and cloning git repositories and building that's going to be pretty much plug and playay click here click there and you're done so if you want to jump to that I'll leave chapter markers down below we're going to kick things off with this llama.com in the name it can actually run lots of different models that are out there not just Facebook's llama let's jump over to the terminal and I do have cond installed if you don't know what that is or how to use it I'll link to a video where I show you step by step how to do that but I'm going to just use it here so I'm going to do cond create to create a brand new environment I'm going to call it llama CPP we're going to use Python 311 so let's do cond activate to get in there now if you have G installed it might clone repositories fine but it might not be able to support large file downloads because we're going to need to get some of those gigantic files which are the models they could be 7 GB they could be 13 GB some are 30 GB we're going to use home Brew brew install git lfs this will give you the ability to use lfs large file support in git when that finishes you need to read the instructions it says update Global git config by running this command so copy paste boom and it says git lfs initialized now let's go back to that repository and I'm going to clone it on the terminal get clone paste that URL mess something up try it again what am I doing here now I'll go back to the terminal get clone paste that URL and and we've got the repository there it is llama CPP if we take a look inside we got a bunch of files in there now it's got this um python requirements files so we're going to install the python dependencies PP install d r and we're going to read that requirements file that came with that repository we're good to go now this is a C++ project and a lot of other uis are actually based on this project like the one we're going to look at LM studio so what we need to do is actually call make in order for it to build and you should already have the developer tools installed on this machine if you've installed xcode and youve followed my instructions from previous videos on how to set up your environment if you just call make inside this directory it should go ahead and build everything for you just like this this will build from C++ code that's inside this directory all the tooling that you're going to need to manipulate your models do quantization of them and run benchmarks and run the actual models this build process also ensures that this will utilize has the GPU that's inside the M1 or M2 or M3 Apple silicon machines the next step is we need that very large model head over to hugging face you'll run into a lot of these models here's one of them this is mistl 7B 7B usually means that it's 7 billion parameter model there's 13B you'll see 15 you'll see 30 and so on this one is considered one of the newer that's pretty good you can see there's 22,000 downloads in the last month and if you back up up to this technium directory you'll see other models available that you can also download and install the same way there's also popular models from the bloke you'll no doubt run into this person and their models and down here on this page you'll see some Benchmark results of how well this model does against some of the more popular other models let's go back to our terminal and I'm going to do get clone and then paste that in except I'm going to rename this to this open Hermes d7b DV B25 this is going to take a little bit of time to download because it's huge now you need to actually place this folder that you just downloaded into the model's directory in order for this to work we need to move open Hermes into the models folder the next one we're going to need is called convert it's a python script and we get to select do we want F16 F32 just different levels of quantization I'm going to say python convert and I'm going to ask it to create this new model that's in this new format GG and it's going to be an F-16 file type if you want to see what these look like you can go to open Hermes here this folder in the models directory and then you'll see that here are the originals that we downloaded which is about 13 GB 14 and this is this new one which is another 14 and A2 GB that we've just created we're talking about really large file sizes here we're not done yet we're going to do some more quantization I'm going to use quantise here as the repository recommends and this quantise tool is something we just built using make we did that earlier we're going to point it to this new F-16 file that we've just converted and we're going to Output a new ggf q80 file there it goes it's doing the conversion eating up my hard drive space there we go so we just converted that 14 1/2 GB file into a 7 1/2 GB file we've scrunched it up much smaller and I'm going to do it again this time to another format this one right here this is the Q4 K format we're going to take a look at these different formats a little bit later in the video cuz they do have different results all right that one is even smaller 4.3 GB I'm going to run a benchmark now to see how quickly we can generate tokens we're going to use the batch bench tool by the way I'm going to link to all these commands in the description so you can also copy and paste these it's easier copying and pasting than reading them off the screen in the video this is the batch bench right now it's actually doing the inference the CPU is barely really being used but the GPU oh don't make me a liar where's the GPU usage there it is it's maxing out that GPU it's not using the CPU for inference it's only using the GPU and memory that's a big bump in the memory there this machine has 64 gigs of RAM we're using 18 and we got 46 GB cached you can take a look up here when this thing is running so right here we've got a 13.49 GB model 7 billion parameters and the memory required is about 13 GB so if you try running this on an 8 GB machine you can imagine what's going to happen even a 16 GB machine might struggle a bit with this and we're done now what does this mean well here's the prompt size we can change the prompt size in our initial command here this parameter happens to be the prompt size you can read the docs I'll link to that down below of what all these parameters are TG is text to generate B is how many connections are incom coming at the same time if you're hosting this model it can have multiple connections being processed at the same time in parallel right now this is only doing one some of the more interesting ones here are this one sore PP this is how quickly the tokens from The Prompt are processed and this one is how quickly tokens can be generated so we got 237 tokens per second is the total speed here we can actually decrease the memory requirement by using the quantized model so here we're using the q80 quantized model that we've converted earlier and we're using the same parameters here let's run that one and here you'll see that the memory required is half 7,338 so on a 16 GB machine this should be able to work fine and you'll also notice that the speed of tokens per second has gone up it's a matter of balancing these things out making sure you have enough memory making sure that the token amounts are correct and this is where if you don't want to have to worry about memory then you get the 192 GB M2 Ultra machine right or the 128 GB M3 Max machine oh there it is the fans have kicked up and the temperature is really kicked up here we've got all the cores heating up to in the high 90s now look at that GPU kicking up a storm memory is still okay well yeah memory is okay because we only using the 7 GB requirement here by the way let me take this moment while it's running to uh just ask you to go ahead and like this video If you like this kind of content content subscribe to the channel especially if you want to see that video where I'm going to be training that's coming up soon do you hear that you probably hear that by now right we're at 5,500 RPM for that fan that's nuts as a side note this right here is the loudest MacBook ever it's louder than the Intel one when the fans spin high enough the speed here is significantly faster I'm going to go ahead and terminate this I want to show you one more thing here which is this command this is a basic server allowing us to serve up this model with a HTTP back end this specifies that we can do four parallel tasks at the same time and there we go the server has started this is the URL I'm going to go over here to my browser plug in that URL and look at that we've got a little server where we can actually change some parameters here on the fly but I'm just going to go down here and use it like Chad GPT create a python function to extract email addresses from a string let's see hey that's that's actually really fast that's really fast wow it looks right I haven't tested it I'll do that in a separate video that looks pretty good and it was really fast it even explains it I don't know about using more complicated examples but for that it worked pretty well now there is another tool that actually comes with this llama.com mates this entire process that we've just done and allows you to automatically download some models let me show you you call this bash script this basically downloads quantizes is and runs the server all in one step and it even has a little wizard to help you out press enter to continue it asks you which model you want here's some options these are different than the ones we used we can do something like code llama for example let's go with three it's going to get that one and then the model files these are the specific quantized files the different versions uh let's go for this one number seven so it's going to go out there grab this thing spin up a server and be a available just like we saw but everything is automated consider this a little bonus for you that was just so much faster than when I did it manually and it's done already Wow Let's uh have a look here yep there you go same interface hi there how can I help you can you code of course I can that is insanely fast write a Python program to extract email addresses from a string I have a feeling this is going to be fast let's see uh sure what should be the input huh the input should be a string okay okay I'll give it a shot here's my code so far oh maybe maybe I don't know I'm not so sure about this one continue oh an array of strings there's a bit more hand holding here not like Chad GPT awesome thanks for your feedback okay so this one is a little wacky let's quickly move on to the next thing the next thing happens to be something called LM Studio which was buil on top of the thing I showed you and this one makes everything so much easier all you do is just download this here's download LM studio for Mac double click install it drag it to the application folder and well look at that it's done already let's run that this is pretty simple because they give you options of the models right down here right on the homepage you can get any one of these or you can just do a search so let's say code and we're going to select this code llama 7B instruct it's got 311 downloads which means it's the most popular one out of these listed and and on the right we have the different files that you can download but which one do we choose how do you know well click on this model card button right up here which is going to take you to hugging face and you'll see information about this first of all you'll see the different tools available that you can run this with llama CPP is one of them LM studio is one of them there's a few other tools available and here is the explanation of quantization methods told you we look at this I'm not going to explain it cuz I'm not 100% sure about what all these things do but if you are interested you can see the different explanations for how quantization works there's different types 5 bit 4bit 6bit different sizes of blocks and so on and these quantization methods do affect performance so for example these are the actual models the ones that end in ggf these are all code llama 7 billion instruct ones but they are all quantized a little bit differently yielding different results this one down here for example the Q5 km is large it's very low quality loss and it's recommended so we're going to use that one as opposed to like uh this one which is very small but it's got a lot of loss let's go back to LM studio and find Q5 km right there download that one and down at the bottom you'll see the model downloading want to see how fast my internet is check it out I'm just kidding I sped the video up all right it's done now so now we can chat over here by clicking on this chat bubble and at the top we need to select the model so if you have multiple models you've downloaded you can just switch between them this is really cool very easy to do I only have one here so I'm going to select it and now the model is loaded now we just have a Chad GPT type of interface including history on the left here write a python function that extracts email addresses from a string boom now this is not how you should be running this on an Apple silicon machine but I just wanted to show you this and show you what's going on it's pretty decently fast it's not as fast as Chad GPT for example regx doesn't look look exactly right but it did create that function right here and it made a little explanation interesting so why did I say this wasn't the way to do it well take a look over here first we got RAM usage expected it's about the same size that we saw on the uh description CPU is at zero but if we hit regenerate notice that CPU going up and if we look at the CPU history we got spikes there it's not using the GPU for this which is why it might not be as fast as possible let's go on the right side now under pre sets you have code llama completion right now which is one of the presets you can use but we want to use more of a chat interface so I'm going to actually select default LM Studio Mac OS one and then under model configuration we want to open Hardware settings by the way you can change all these different parameters like you did in the other tool but here they're just more prettified but the one that we really want to change right off the bat is because we are on an M3 Max or any Apple silicon machine with a GPU we want to go to Hardware settings and enable Apple metal GPU right now it's off if we enable that next time we run it it's going to be turned on so I'm going to click on this reload model to apply configuration this model seemed to have reloaded pretty quickly but sometimes they don't so I need to restart the program in this case it worked fine let's go and regenerate that was a little different wasn't it it was incredibly fast and it only spit out the code I'm going to copy that prompt and let's start a brand new chat not sure what's going on over there but it is using the GPU and it's still generating it's just generating blank text I'm going to restart LM Studio Al together just completely quit it let's restart it go back to chat select my model experimental warning both metal and mlock are enabled so metal we enabled here but mlock means there might be a danger of a machine freezing for example if you have a machine that only has 8 GB of RAM and you run this model through it it's probably going to crash the entire machine not just going to become unresponsive it's just going to crash it warns you again here if you have 8 GB of RAM metal is not recommended I know what I'm doing cuz this is a 64 GB machine don't try this on an 8 GB machine let's go ahead and try that one more time okay that's a little bit better I think no no it's not it's gone off the rails but at least it's using the GPU in this case it might be the model you might need to find a model that's a little bit better than this don't blame the tool LM Studio the tool just makes things easier makes the model access easier but clear clearly this model has gone off the rails and doing silly things so get another model but now you know how to get these tools working and it's really nice to have something like LM Studio to be able to bypass all the configuration options and just have the tool handle all the setup for you I hope this was helpful to you if it was go ahead and give this video a thumbs up leave a comment down below just say hello to me and I will see you in the next one

Info

Channel: Alex Ziskind

Views: 91,634

Rating: undefined out of 5

Keywords: apple, apple silicon, apple event, m3, m3 pro, m3 max, macbook pro, software developer, 14-inch MacBook Pro, 16-inch MacBook Pro, Mac laptop, Apple silicon, M3 chip, M3 Pro, M3 Max, programmer, software development, programming, developer, developer tests, m3 chip, machine learning, llm, m3max, m3 llm, m3 ml, m3 max ml, ml on m3, machine learning m3, m3 machine learning, m3 ai

Id: 0RRsjHprna4

Channel Id: undefined

Length: 17min 0sec (1020 seconds)

Published: Fri Dec 01 2023