How to run a local AI chatbot on Windows in 5 min, no cuts, no edits, with Ollama, LMStudio, OpenAI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey friends lots to talk about AI these days and one of the things that I wanted to show folks was that you can run a local large language model on your local machine now you need a machine with a decent video card to do this kind of work uh but it's just useful to know that you can do you know chat Bots you can talk to a chatbot you can even pretend to be the open AI endpoint without having to be you know out of airplane mode you can do this work on an airplane because you don't always need or want want to send all of your data off to open aai or Azure Open aai studio or something like that sometimes you want that data to be local so I just want to remind people that local models exist so I'll show you a couple of ways that you can you can do that work so here's my uh my machine right now let's just take a real quick look at what this machine has this is a desktop but I've done this before on a uh on a laptop as well so here we've got uh an Nvidia 3080 so it's a decent sized machine um and I am uh I've got about 10 gigs of RAM on this particular GPU um you can go and see whether or not you can run a local model on your machine by right clicking on the start menu and then clicking on device manager and then go over here open up display adapters and then take a look at whether or not you have you know a decent video card a decent video card would be usually a 2080 or 30 80 I've seen people do it on a 1080 uh but you're going to be pushing the limit so you're going to want to be careful if you only have Intel on yours you probably won't be able to do it especially if it's a laptop but uh you can always find out and give it a try and if your computer blows up then you'll know what happened okay so the first thing I'm going to do is I'm going to download LM Studios this is LM studio. a there's a couple different choices here first is LM studio and then I'll show you oama this one is great it runs on Windows Windows Mac and Linux so I'll bring this up here and you can go to their kind of little store it's basically a little Discovery location you can pick different flavors of different uh different models and different models are trained for different stuff here's one uh from Microsoft called f 2 and they talk about the things that it's good at right so like this one here has not undergone any fine tuning uh it is open source and they're focusing on toxicity reduction and you can go and download it you can see this one's about 2 gigs quantized it's basically squished and uh I can scroll scroll scroll and find all kinds of different ones that I might want to pick and um if I want to run one of these I can click on the chatbot choice right here and where it says select a model to load these are the ones that I've pre-installed so here's one called code llama that's kind of like co-pilot so this is a good model if you want to ask it questions questions about how to code here's one called Mistral that's more of a generalized one so if I go and click on select a model and then hit Mistral I'm going to go right here and I want you to watch in the task manager as that gets loaded so this is going to get loaded right here and we're going to see the memory increase in my CP in my GPU memory okay so in a second here you're going to see that pop up and then we'll go and say hello now this is how happening entirely locally okay that means that see how it's a little slower than you'd expect that's being streamed locally now I'm going to click eject think of like ejecting a tape from a VCR if you're old I'll hit eject the model and then you see a little bit of memories getting released here and then let's pick code llama which is a slightly larger model and then we're going to accept the new prompt the system prompt is basically the prompt that it's being set up with like your you're a helpful assistant or whatever okay so we'll load up that model you see a little bump there and um you'll notice though how the bump is very small well I'm going to actually offload more of the model this model has 32 layers you can pick the amount of layers that you think you can handle uh and load more of that model into memory it'll be faster so I'm going to actually go nuts and click1 and negative-1 is going to load all of the models uh in this case this will be 32 32 lays excuse me all of the layers rather pardon me not all the models so I'm going to hit reload model and then let's watch what happens to my GPU as I do that okay so I hit negative 1 we're going to see this start pumping up there it is big move so that's loaded the entire model into into memory there okay so I get code llama and I will then say write me a c hellow World app see how much faster that is that entire thing was loaded into into memory there into GPU memory but I want to point out that all that work is happening but very little is happening in my CPU and the memory itself so we're pretty much chilling okay so imagine yourself on an airplane having a conversation with a large language model in this case here the model is as we mentioned 7 billion parameters it's quantized so it's about uh think about 4 gigs you see that GPU memory right here right right down there is increased now I'm going to go ahead eject model you want it watch it immediately getting ejected right there pick another one like five accept a new prompt I'll push my task manager over a little bit you can see the memory just popped up up again right there and we'll start a new chat we're going to set our GPU layers to1 every time I make a change it needs to uh unload and reload are you there I'm currently present make me a taco recipe there you go cool now if I've got this model ready to go and local this is cool thank you I'm going to go ahead and minimize that because we know that we've got it loaded in memory I'm going to click on this button over here and this is where I think the magic happens from a developer perspective let's click on that this is going to go and create a local HTTP server that behaves like behaves like pretends to be the open AI API so I'm going to hit start server and what that's going to do is it's going to make a local host server here on Local Host Port 1234 and if I go and grab that we'll just do a get I can do that with curl or I can do that in the browser so I'll just do it in the browser to start with I just hit Local Host 1 2 3 4v1 models and you can see there's the GG UF that's the like the zip file for for AI uh and it's sitting in my cache folder in this cach LM Studios folder and in fact we can go down there into that folder and let's take a look there's the look at that there's the cached models that I've got in fact I can select them all right click hit properties and you can see that I've got 17 gigs worth of downloaded models on my local machine again all this work can be done entirely in offline airplane mode airplane mode works on the ground okay let's take this curl right here curl of course is going to let me go and do an HTTP call and we'll go and tell the thing hey always answer in Rhymes that's that's the system prompt and then introduce yourself and we'll go and call that out with curl this might not work because it's on multiple lines yeah so let's let's go and try that in ubun 2 otherwise I'm going to have to change that prompt to not include the backslashes okay so here it's saying bail to connect to Local Host connection refused so let's go back here and make sure that that is on on I'm ensuring that the server is started and sometimes you have to make changes and do do reloads okay and you can see down here if it doesn't like what it's being called it will it'll holler at you okay okay sometimes this this is a little frustrating this happens to me uh when dealing with multi-line calls to Cur Pearl I'm just going to put all of this on one line for now which is a little lame but it is what it is okay so here it's saying couldn't connect to server so let's go ahead and confirm that that server is in fact running and that I'm not making a mistake okay so here it's saying Local Host refused to connect go ahead and close that I'm going to go and confirm that we're on the right place and that we're there so somehow it's gone I'm going to start and stop it again there we go bring it back okay now you see that there's a bit of a pause there right see how that was a bit of a pause there's your content right there that's because that's not streaming remember when we did the chat it went and it streamed and it went chunks chunks chunks in this case it had to wait until it was done additionally if you look over here in GPU offload I didn't load all the layers so that's going to make it even slower as well so let's change that to negative one reload to apply changes we're going to reload that again I again call note to the there it is the GPU memory see how it pops up now the whole model is loaded into GPU let's go ahead and try it again down here then we're going to get that uh so I'm going to have to again start and stop it looks like the server's got a little hiccup there and then it comes back a lot faster I also want to call out that it uh is returning different stuff first time it said this second time it said this it's important reminder that models like this are not typically deterministic it's going to say different things each time so you want to count on that uh and I also want to count call out that this is free right this is these are open source models so in this case here it's running on my local machine this is the Microsoft 5 2 model running on this local machine now this is using LM studio so I'm going to go and eject that model and close it and then I want to call out another model that you can go and check out it's all free this is olama okay ama.com you can download it on Windows and it'll install itself like this and it'll run in your tray the tray is a little thing down by the the clock so then once you've got that installed you can go and run olama at the command line and if it's paying attention it'll load up and it'll say okay start a server so very similar kind of concept as we saw with LM Studio you can run different models you can list out models that are available and uh that'll allun run on Windows or Linux in this case here I don't have any models so we could say go and pull a model I think AMA 2 is a model that we could pull can't never remember the model here uh on this machine here there we go llama 2 rather pardon me llama 2 it'll go and pull that model down and this model is about 3 gigs so so that in that case is taking a bit of time to go and pull that down I'll pull myself out of the way so you can see how fast that's coming down so you are going to have to tolerate that onetime pull of about four gigs and if you're doing work like this like I am and you're running around pulling models left and right as you saw when I looked in the cache for LM Studio I could end up with 20 30 50 gigs of models very very quickly so you're going to want to be conscious of that that space uh is going to feed uh feed fill up your hard drive pretty fast okay okay so we just pulled LM uh llama 2 rather that's going to go and then validate that do a little verification grab another layer come on let's go people are watching so funny that there's like this big one here which is like four gigs and then a bunch of little stuff little little top things there you go and then we'll go and say run so it's thinking hey are you there there's your reply make me a c app now llama 2 is not a great you know app for C it's more of a chatbot but how are you doing tell me about Scott Hanselman this is this is lies this is this is the thing about models like this this is nonsense look at this March 1964 not anything uh that's not a thing not from San Francisco did not grow up in the Bay Area didn't do any of this stuff never worked at Borland um did work at Microsoft just this is unhinged now it says I have a Consulting business uh called Hansel minutes no I have a podcast so you get the idea right you got to expect what you expect which is expect a little I'm going to go ahead and hit buy and I want to call out one last thing that you can do which is another option if you don't want to install any of these things but you have docker here I'm going to go ahe and run Docker right now I'm not running it so I'm going to start running the docker desktop on this machine here comes Docker it's going to go ahead and fire up okay so now Docker started up go ahead and say Docker images as this thing spins up there you go there's a bunch of Docker images and I'm going to say Docker run I just run uh llama a oh Lama rather as a Docker image okay so there it's running right there serving now I could say go and run it so without using the dot the olama Windows application I'm using doing the same thing hey there what's 2 plus two cool so that is running it entirely in a container now when you're doing that though of course you need uh WSL you need Docker you need a good GPU you need to be monitoring your GPU memory at all times you can see that my GPU memory is kind of working a little hard right there with olama you need to right click on it and quit so I just right clicked and quitted you can see that the memory just got released when I right click on Windows and and quit that simply saying buy isn't enough to release that uh that memory so we've got a couple of different options here for running local models we've got olama we can download that run it locally Mac Linux and also Windows preview just came out we got LM studio and then and then of course you saw me just run those inside of Docker in this case on Windows in just two lines uh Docker run right there I've just of course gotten rid of it and then Docker exec so local models are cool even cooler when they can pretend to be open AI endpoints save you time save you money uh ensure privacy and also give you some performance um again we saw that with some of these models they can uh they're not they're not grounded we used to say hallucinate but you want to convert that and think about it as being more grounded in reality so you need to consider those groundings when you are uh implementing these kind of things in your application so that was uh local models in 15 minutes uh hopefully this was useful if it is please share the video like And subscribe smash that Bell uh and I'll see you again later
Info
Channel: Scott Hanselman
Views: 48,134
Rating: undefined out of 5
Keywords:
Id: _AxXtXwdZmY
Channel Id: undefined
Length: 16min 47sec (1007 seconds)
Published: Wed Feb 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.