Unlocking The Power Of AI: Creating Python Apps With Ollama!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there I'm Matt and I want to help get you up to speed with developing applications with olama using python I'm assuming you already know what AMA is and how to work with it if not I have a 10-minute video intro to olama right here which should get you up to speed on all the basics but you want to start building applications that leverage it great let's start by looking at how to access the API AMA has two main components first there's a client the client is what runs when you type ol run llama 2 it's the repo that you work with the other main component is the service this is what olama serve starts up normally you don't do this interactively instead it runs as a service in the background that service publishes the API the CLI is just another API client that uses the standard public API just like whatever tool you're going to write the service publishes a few different rest API endpoints you can find these at the GitHub repo under docs and then api. MD let's start by reviewing these endpoints and I'll point out some things to keep in mind it's pretty important to understand the underlying API before we can get busy with the python Library there are a few different things you can do with the API you can generate a completion using the chat or generate endpoints this means asking a question to a model and getting an answer back you can create delete copy and list list models you can show model information such as a simpsom prompt or templates you can push and pull a model to and from olo.com and you can generate embeddings let's look at generating a completion there are two endpoints here chat and generate both can do exactly the same thing but which one to use depends on your use case if you're just going to ask a model a question as well oneoff request and not ever hold a conversation with it then using the generate endpoint makes the most sense if you need to go back and forth with the model and manage memory or context then the chat endpoint is more convenient again you can do everything with both but if it's just a one-off request then managing the messages in the chat endpoint is a bit of extra work which is only really valuable in conversations okay so generate can be found at local hostport 11434 API generate if you want to use the API on a different host refer to this video that shows you how to host AMA on a different machine for Generate there is only a single required parameter model this is the name of the model you want to load if you call generate with just the model name and the model's already loaded then the unload timeout will be reset to another 5 minutes with one exception that we're going to talk about later The Prompt parameter is the question you want to ask the model this will be inserted into the actual model request based on the template either that's defined in the model or the template you specify in this request images is used when you work with a multimodal model such as lava you can provide an array of images that are base 64 encoded the model can only deal with a Bas 64 encoded image so you must perform that conversion yourself we'll see that this is a little different in the python Library this is an array but I haven't seen it do anything special with more than one image yet notice that the response is a stream of Json blobs each specifying the model created at response and done so most if not all of the endpoints respond as a streaming API each Json blob includes a token it looks like most of them are full words but some of these responses show the parts of words or tokens scroll all the way down and you can see the last blob has done set to true and includes the context that you can provide to the next generate call it also shows all the stats for this generation you'll notice that it does not provide tokens per second but you can just divide the tokens by the seconds to come up with that yourself if you want the response to be a single value after the generation is complete you can set stream to false we can see here that the response shows all all the tokens and the response joined together if you set stream to false then you have to wait till all the tokens are generated setting stream to false will not speed up the response and in fact may feel slower since you have to wait longer for that first token but sometimes it's useful to turn streaming off if you're returning Json so that takes us to the format parameter this only allows for a single value and that is Json using format Json is more than just specifying the format value you should also say something like respond as Json in the prompt and ideally provide an example of the schema in the prompt as well otherwise you run the risk of outputting a different schema each time you use it the big array of numbers in this output is the context if you want to provide this answer and continue the conversation with the model pass this context to the next call that generate endpoint but it's a bit of a black box that you can't really control so if I I know I'm going to need to hold a conversation I'll stick with the chat endpoint options system and template just override whatever set in the model file raw is there in case you want to handle the entire system prompt and template yourself entirely there were some early users of the API that had a system that managed their own system prompts that needed this this also removes the context from the output finally there's keep alive keep alive defines how long the model should stay in memory the default is 5 minutes but you can set it to any time you like or minus one to keep it in the memory forever if you provide just a number to this then it will default to seconds but if you provide it as a string it has to include an s or M or H to indicate the granularity for the chat endpoint all the parameters are the same with a few exceptions the contact system and prompt parameters in generate are replaced with messages messages is an array of message objects that can include the system prompt the prompt from the user the examples of any output and any memory or context that you want to send to the model each message has a role which can be system or user or assistant content is the actual text of the message and optionally images is the array of images to send for multimodal models I'll do some more complex examples later when we get to the library the rest of the endpoints are a bit simpler and should be understandable using the docs so let's switch gears to working with the python Library you can learn more about it at github.com o/- python one of the benefits of the Python library is that it simplifies the switch to and from streaming the function calls return a single object when not streaming or a python generator when they are streaming okay so let's get into the code this will be part of the video projects repo under intro Dev python d224 042 first step is to add the olama python library with Pip install olama this first example is 1. pi I'll start by importing AMA the simplest example is just ol. generate and set model to any model remember this is going to reset the keep alive parameter for this model but let's set the prompt to something 2py why is a sky blue and we can run it notice that we get the non-streaming response in the API it defaults to streaming but here we default to not streaming I think this makes more sense but changing the rest API means breaking a lot of code out there in the process so that probably won't happen until there's a need for a new version of the API folks are usually pretty reluctant to do that so add stream set to true in 3. pi but we don't get much of a response so let's iterate through the parts of the output and write that out but that's a bit hard to watch so rather than printing the full chunk let's print just the token and that looks better we can add the stats at the End by looking for done is true durations of n llama are measured in nanc so we need to multiply it out to be useful in 4. pi we're going to make a second generate call that remembers the first then save the value of the context from the last blob and feed it into the context of the second call here I've set the prompt to be vague so that we know it was using the context we can try again without the context set in order to verify that it does what we think okay now let's try describing an image in five. pi I have a JPEG of my hand in this folder this one unlike the rest API which expects a base 64 encoded string the python module expects the image as a byes object in fact it won't work with a base 64 encoded string then just have the model describe the image and that worked just fine let's move on to using the chat endpoint as we saw in the API section it's pretty much the same thing with the exception of how messages work we can try a simple example in six. pi I'll just set a system prompt and then the user prompt so that means defining an array of messages each message includes a role in the content then feed messages to the model and spit out the results for 7. pi we'll do something a little bit more complicated using format Json an example schema and some example outputs you see the example in there I specify a user said Paris and the model responded with information about Paris this helps the model understand what you need so now when we ask about Amsterdam we should get something interesting there's one more thing I want to show you every call we've made to Alama has been local thus far so HTTP localhost 11434 was the base of the URL but what if you set up a server that's not your local machine I set up a Linux box on brev dodev and called it Mets remote olama API I installed AMA and pulled llama 2 then install tail scale and set the machine name in The Tail scale admin to Matt's remote olama API then set AMA host environment variable to 0.0.0.0 and restarted olama now in the code on my machine change the AMA import and create a new olama client pointing to the remote host and now the code just works I think the rest of the functions should just make sense you'll find other examples using code in this repo which is the code for most of my more recent videos if anything wasn't clear in this video let me know in the comments below it's hard to fix the video but if it's a glaring Omission I can look at making a new video and join us in the Discord at discord.gg olama thanks so much for being here goodbye okay
Info
Channel: Matt Williams
Views: 18,913
Rating: undefined out of 5
Keywords:
Id: _4K20tOsXK8
Channel Id: undefined
Length: 12min 11sec (731 seconds)
Published: Mon Apr 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.