The Secret Behind Ollama's Magic: Revealed!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

how does olama run what are the pieces what happens when you ask a question to the model I've seen these and variations of these questions on the AMA Discord on the comments for videos in my channel and even on Twitter where I'm techno evangelist and on GitHub where I'm also techno evangelist so let's look at all of this in a bit more detail as of the recording of this video Alama is running on three platforms Linux Mac and windows so for those three platforms there's a single supported installation method for each on Linux there's an installation script that you can find on the site and on Mac and windows there's an installer what do they do well let's look at the script for Linux there's a lot here that is just dealing with Cuda drivers in fact in the 266 line shell script well over 150 lines are just dealing with Nvidia the rest of the script copies the binary into the right spot then sets up a new user in group so that the service doesn't run as you and then sets up the service using system CTL and ensures that it stays running the Windows and Mac apps are a little different but ends up with the same basic results there's a binary that runs everything and there's a service running in the background using that same binary so there is only a single binary but it can run as a server or as the client depending on the arguments given to work with AMA there's always a server and 's always a client that client can be the CLI or another application someone's written using the rest API when you run AMA run llama 2 you're running the interactive CLI client and the client doesn't actually do anything other than pass the request onto the server again the server is running on your machine as well we aren't talking about any service running up in the cloud unless you have set up a server to do that the server running in in the background takes that request and then loads the model and lets the client know that it's ready now you can interactively ask a question the same thing happens when you use the API to send a message to AMA the olama server loads the model and asks the question then it Returns the answer to the API client there's no need to run AMA run and the model if you're using the API the CLI client is just another API client just like the program you're writing now I said this all runs locally there are three exceptions to this the first is when you have put the server on a remote machine and the other two are when you pull or push a model in that case a model is either being downloaded or uploaded to ama.com registry so that opens up another question does ama use my questions to improve the model and does that get uploaded to ama.com when I push the model that is of course one of the big concerns with using chat GPT and other online models often your interactions with them go back into making the model better asking a question and getting an answer out of the local model with a llama can take a while but that amount of time is nothing compared to the time required to fine-tune or train the model using your data you would hear the fans worrying hard for a good long time if olama was able to do that ama has no ability to fine-tune a model today so when you push a model none of your questions or answers are added to the model for the most part I'll talk about an exception there in a bit now some folks hate that there is a service that runs models running in the background all the time it's going to take memory right models are big and it shouldn't stay in memory right well the memory consumed by the service is whatever is needed by the model while it's running then AMA will eject the model after 5 minutes although that is configur aable at that point it drops to a minimal memory footprint there isn't really much reason to stop it at that point but if you feel strongly that it shouldn't run then here's how to do it on Mac come up here to the menu bar click the AMA icon and then choose quit olama on Linux run system CTL stop olama at the command line and on Windows come down here to the tray icon and choose quit olama if you just go and kill processes they'll restart and you will get frustrated if you're on a Linux drro that isn't system CTL then you either installed it manually or used a community created install there are a few of those and I can't really suggest the right way to do it there to get it started again run olama on Mac or Windows and on Linux run system CTL start olama now some folks hate the fact that olama gets rid of the model after 5 minutes they either want the time to be short order or longer you can set the time using the API parameter of keep alive if you set keep alive to minus one AMA will keep the model in memory forever or you can specify the seconds minutes or hours for now as of this recording this has to be done in the API and not via the CLI or environment variables remember a moment ago I said for the most part with regards to adding your questions and answers to the model there actually is exception to this but it's a pretty special case if I do a llama run llama 2 and then ask it a few questions why is the sky blue why is the wavelength longer or shorter can i surf those waves it's an example give me a break then I run slave mmat waves where m is the name of my namespace on al.com I have now save those messages as part of the model the messages aren't in the model weights file though I'll come back back to that in a sec so let's take a look at how this works first the Manifest for this model here we can see there are a few layers and this one is called messages Let's Open The Blob file it mentions see the questions they're all there along with the answers so these are just like the messages you would set if you use the chat API for instance if I want to use a fuse shop prompt showing the model how I expect to provide a Json formatted object I would add the messages here this is available in the model file as well using the message instruction and they appear here as this messages layer so you might think based on this that you could just edit these messages here in this file and it would replicate that behavior we saw in the model file or in the API that will continue to work locally but as soon as the file was pushed the file signatures would be different and it wouldn't work so if you want to achieve this with the CLI you'll have to create the model file and update the messages there again pretty special circumstances when we looked at the Manifest we focused on the messages layer but there are a few other layers as well you can see one for the system prompt template and others one of them is the model weights file that's the really big file for each model see how the name of the file is the Sha 256 digest of the file when a llama gets the Manifest it looks to see if the corresponding files are on the system if it already has the file it doesn't download it so you might have a model llama 2 and then another model called my cool model someone has created based on llama 2 with a unique system prompt when you pull that model the model weights file will already be there so adding the new model will have minimal impact on Space consumed by your drive similarly removing llama 2 at this point will have minimal impact on your drive because Michael model is still using that model weight file I think that's all the info on how the pieces work together and how to use Ama if you have any questions about this or anything else let me know in the comments thanks so much for watching goodbye

Info

Channel: Matt Williams

Views: 17,738

Rating: undefined out of 5

Keywords: llama 2, artificial intelligence, mistral ai, large language model, large language models, how to install ollama, chat gpt, run ollama locally, ollama on macos, llama2 locally, llms locally, how to install llama 2 locally on windows, llama 2 locally linux

Id: Z52no0QQ0hY

Channel Id: undefined

Length: 8min 27sec (507 seconds)

Published: Mon Feb 19 2024