How to run open source models on mac | Yi 34b on m3 Max

Video Statistics and Information

Video

Captions Word Cloud

Captions

okay in this video I'm going to show you how to run the g34 B one of the latest open source model on a Mac in an easy way using LM studio and it will run using your GPU so you can see the different performance between using the CPU and the GPU and this is going to be running on M3 Max 128 GB so this is the max out version of the best Mac that apple is doing at the moment so you're going to have this link in this link I'm going to share with you uh the steps and the files that you need so first things we're going to install LM Studio I'm going to click on that you're going to download for your Mac and this application by the way is available for Windows and Linux and Linux is kind of trick is not working properly as well at the moment uh but you can try it as well and windows will should be the same installation so once you download you're going to have a interface like this one showing up here and then the second thing we're going to do we're going to we need to point the file that we want to use in this case we're going to be using the blog g34 ggf ggf is the new file uh version that you can handle uh the metal support for the GPU on Mac okay so it's like using CA but in Mac we use metal to use the power of the GPU okay so once you do that I'm going to open LM studio now and once you open L Studio you're going to have an interface like this where you can find the model by just typing on the on the house you can type G and you're going to see a few of them now we're going to use the blow g34 B GF or you can use for example uh this one as well we have a 200 uh th000 context window I'm going to use just the first one and then you're going to look up the options here now important thing to consider and the reason I switched to the M3 Max is because this model is using more than 30 gigs of RAM so you're going to have to keep in mind that if your Mac has 24 gigs or 16 this is probably not going to work okay so for the quantization versions you have different option here and what I recommend you always is go to the model card of the model it's going to open a website and you're going to going to go down here and he's going to usually the block is pretty good one it's going to tell you which one's best than uh which one is better for running inference see here large very low quality loss recommended so basically most of the time the ks is a pretty good one uh to go with that quantization is good so we're going to using the Q5 KS model okay so going back to element Studio we're going to use that one so you click here in the qf ks and then you click download once that load you're going to see here is downloading the model and once it's finished you're going to go to the chat version here and then you're going to I'm going to delete this chat and then going to load the model and it's here is going to show all the models that you already download so I have two of them I have Lama 70 billion and then the G2 200k um model going to click on that and you're going to see here on the right side you got a few parameters okay so now that we have the model uh loaded the best parameter that I found to load the model with the settings is the default LM studio for maic OS and you're going to have the same feature for Windows so the same default LM Studio for Windows so once you do that the only next thing you want to change here so you want to go down and you want to change the Apple GPU support this means that you will use the GPU instead of a CPU and we're going to see the load here on the left side you see then the RAM usage or GPU usage uh memory at this point is 24 gigs so if you have a 20 an RTX 490 you may be able to run it but probably you're going to need more than that and same thing goes for or if you have uh if you have an M chip from Apple that has less than 24 gigs of memory it's probably not going to run we're going to ask the model to write the pie game uh snake game it's a pretty easy task and most of the models have failed to this to doing this and I see that uh CH gbt for example is doing it uh pretty good on GPT 4 so let's see how what it can do snake game and py game using py game and python yeah let's see what it does here we go so I want you to check as well the generation so we're going to see that the generation H how fast it's generating how many tokens per second is doing it and as you can see here it tries to uh for some reason it was writing the code and then suddenly break the instruction and go back to the front so yeah that's something that it's not working properly so we're going to do import this thing it's not going to run any game just by writing this code so people saying that yeah you can modify you can do instruction changes or whatever but I don't want to do that I want to see U this model to be performed uh in a very easy way it it doesn't matter if you release a model and then uh you have to deal with uh interfaces or how to prompt properly to give you the answer that you're looking for for example if you download Lama 2 in any version of the Llama you just run it here click the setting and automatically you have a great response so I think it's useless this part here and uh the output of the model you see the speed the speed is not as fast as llama even Lama 34 billion is way faster than this one at least on the M3 Max and what we're going to do here is we're going to disable the GPU and this time we're going to run on the 16 core CPU so you can see the difference of output in the signal when the GPU uh so you can going to see the difference when inference um the model you the CPU or GPU what you saw before was the GPU now we're going to run on the CPU okay we're going to run the model reload the model again and then we're gonna do the same thing so right python let's let's make a new chat so write have python code for snake gain for example and then you're going to see how fast so you're going to see here the CPU is being used right now because we're using 376 of the CPU and it's kind of slow it's loading everything on the CPU now now and we're going to see the reaction time so here we go so it's doing it pretty it's going pretty slow compared using the GPU okay so that's thing to consider forget about the CPU just go with the GPU same way that we do on the PC uh world we don't use CPUs even if you have like a 96 core CPU AMD the latest one you're not going to use that you're probably going to use an RTX uh 4090 or a6000 if you need more memory but it's always better to run on the GPU than on the CPU okay so you saw how that is and it still is not doing properly uh question here's it's not answering as it should right and the content Le right now is 1500 and the frequency scale these are the settings that we have here and maybe changing these settings Works differently but like I said we need a plug onpl solution where we can download the models run it and it works just fine I'm not going to test anything else but if you want to play around with it you can download the model on LM studio and play with it I tried already with AMA and I had the same result I could load the model into a Lama but when getting the output it was not as easy to use as a llama to for example okay I hope you enjoyed this video how to launch any model uh using for example the LM studio is going to be the same way you download download this app for Mac or Windows and basically you to download the model and then you can start um playing with it and it will give you a lot of information about how many tokens per second it can do and the parameters you're going to use and how much memory you're going to use on the GPU so that's pretty interesting to see but like I said this video I wanted to make it simple to download and launch and it failed to do that because it's not uh if you try to do with Lama 2 for example you're going to have a great performance out of the box you download the model chat with it and you have a great performance using this application so it's everyone everyone can use it and I think that's the problem with open source models and they just upload the model and they don't uh explain you how to run it or they just leave it all the table and give you instructions that most people don't understand so you have to make it very simple like for example you upload a model you put an interface like a gradio interface or whatever and you show a demo how to actually use it at least to talk to it because on top of that you can add multiple features uh with all the software like line change and so on but at least they should upload the model with the file with the demo that is actually working and you can have an ask question to it because otherwise what's the reason of using these things you just go and use chat GPT because it's $20 a month it's not it's not that much when you look at what chat GPT can do especially now with gbt agents so they have to in order for this open source to be really usable and using your computers they need to have a plug andplay solution when you download and use the model that's my thought that's what I think you have to make things very very simple to use otherwise people would not use it that's the main reason why people use chat gpg because it's a web interface and it's easy to use all right so you want to see more videos like the if you want me to try different models on the Mac let me know I have the M3 max max out version and I'm planning to do the Falcon uh 180 billion model running for the next video all right I hope you enjoyed this one and I see you on the next one byebye

Info

Channel: TECHNO PREMIUM

Views: 9,701

Rating: undefined out of 5

Keywords:

Id: GAo-dopkgjI

Channel Id: undefined

Length: 9min 18sec (558 seconds)

Published: Sun Nov 26 2023