Unlimited AI Agents running locally with Ollama & AnythingLLM

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone my name is Timothy kbat founder of mlex labs and Creator and maintainer of anything llm today I'm actually going to Showcase anything llm just kind of how it works but then also show you something that makes olama models really powerful we're actually going to give agent capabilities to any llm available on olama where you can then search the web save things to memory scrape websites do whatever you want make charts even and I'm going to show you how to unlock all of those abilities by just downloading anything LM and connecting to AMA it'll be really simple but first I want to showcase a little bit of Education about what AMA is quantization and what agents even are first ama if you found this video you've definitely heard of AMA because ama's in the title AMA is a application you can install for Mac windows and Linux and it allows you to run llms using your own computers devices no Cloud no anything like that so it's totally private the way that this is possible because llama 3 is a massive model that takes dozens of gpus to run is through a process called quantization quantization is basically how we can get these models small enough to run on your CPU or your GPU I'm not going to get into weeds of how that works in general you should know how quantization works it's basically compression of an llm and when we get into agents I'll tell you why that's really important the next part of this really short lecture is what is an agent so you have llms and they respond to you with text right they don't really do anything an agent does something it's an llm that is able to execute what people call our tools or skills or there's a whole bunch of language but it's an llm that from your input doesn't just respond with text it actually goes run some program or interface or API gets that information does that action and then comes back to you with the result the response or your question question answered with that tools supplemented help it's like rag but you're doing things instead of just chatting with a chunk of a document and you can see that actually Rags on the top part of this graph short-term and long-term memory which is kind of a common use case for retrieval augmented generation chat with your docs all the same thing what we're going to do is get this working for any llm so you're probably familiar with cloud-based models like open AI or anthropics Claud or perplexity where you can say things to the model and sometimes it can go and do something like search the web which is a very common use case however if you are using ol and you try to tell your model to search the web it'll just tell you that it can't do that well now with anything llm any llm is an agent and can be an agent and can even search the web and do all of this for free on your computer with 100% privacy so I'm going to show you how we're going to unlock that today so the first thing that we need to do is find a good model as I said any llm will work with anything else l m and its agent capabilities however when we come back to quantization there's one detail that people kind of Overlook when it comes to AMA by default AMA will install a Q4 quantization now that probably doesn't really mean anything to you but here's kind of the goby for it q1 is the most compressed version of that model Q8 is the least compressed version of that model but still compressed not the raw model if you have a model that is 8 billion parameters and you compress it a lot to like two or three you basically took something that's already small and then compressed it a lot so now you have a pretty bad model and you'll get hallucinations you'll get it weird outputs it'll just go crazy not even respond to your questions all of these become problems at smaller models being quantized very heavily so what we're going to do today is intentionally download llama 3 from AMA but use the Q8 version so that it is more robust the calls are more reliable and the responses are just better if we were messing with the 70 billion model yeah we probably wouldn't download the Q8 it be like 30 gigs we' use the Q4 and have a good time cuz 70 billion parameters is a lot I know that sounds very technical but hopefully you understand why quantization and picking the right model is a use case science and it's something that you should understand if you're messing with llms at all if you go to AMA and you go to llama 3 and you scroll down you'll see that the 8B tag and the latest tag which is what downloads by default are the same but this tag is also matched to the instruct model which is the same and it is a Q4 so this is a pretty small model and it's basically the middle of the road between size and performance but we want really good performance because we're dealing with agents I'm going to go and find the Q8 version of this model which you can do by just typing it Q8 and you'll see that it's right here it's 8 and 1/2 gigs I'm running on a MacBook Pro Intel it's pretty bad for inferencing in general I have a Windows computer in the other room so I'm actually going to use AMA on that computer run anything LM on this computer all on my private Network so here I am on my Windows computer and I have olama installed if I type in O Lama we have it running I need to pull in that Q8 model and the easiest way to do that is oama and I already have this downloaded because I wasn't going to wait while making this video and so you'll see it downloads all of the layers we're good to go so the only thing left is AMA serve to make sure that the server is running the server is already running and as you can see I have engro running and I'm tunneling my desktop computer in one room to the connection for in another room on my MacBook this is where we can get into anything llm anything LM is a all-in-one AI agent and rag tool that just runs in your desktop fully locally connects with pretty much anything that you care about and it can work on Mac windows in Linux all you do use anything.com download and then click on the proper operating system and Chip architecture and since I have anything llm downloaded we're going to boot it up and because I've never run it before on this computer it is going to basically just ask us what llm do you want to use that should be the first question so here we are on onboarding and it asks us what llm do you want anything llm actually ships with ol L inside of it so the whole setting up AMA on my Windows computer completely extraneous if you have a GPU device I am on an Intel Macbook so it's really old so I'm actually going to use the olama external connection and all I'm going to do is paste in that address from enro and you'll see that my chat models are loaded I want to use the q88 and I know because I know about this model uh it is a 8,192 context window it's really annoying that they don't publish this information for every model you have to go and Google it it's just annoying but anyway we'll just continue so you can see this is kind of a privacy overview we're going to use anything lm's buil-in embedder so everything will embed on this device and we're going to use the built-in Vector database as well so that basically none of my chats are leaving my local network at all all of my do data is going to stay on premises and it'll all just work very nicely and of course you can skip the survey it's totally optional uh let's make a workspace and we're going to just call it sample for now the very first thing that people would want to do is just test to see does the model work so let's just say hello and what this is doing is sending a request to my Windows computer and AMA on that computer is going to stream it back and you can see it works it works about as well as you would expect it to and it's it's fast however while it might be fast because I'm using a 4090 in the other room it's still pretty dumb and the reason that we can say that is because it doesn't know anything about what maybe I want it to know about for example anything LM while people love it and it's great and it's cool it's not popular enough for an llm to know about it so if we were to ask the question what is anything llm it's likely going to make something up and it's going to say that anything llm is a llm which is totally wrong and it's yeah this is all a hallucination none of this is accurate but what can we do to improve its ability to know about anything LM well the easiest way is rag so let's do that first so we're going to go and upload a document I actually have anything lm's GitHub read me already downloaded as a PDF so I'm just going to upload that and then move it over to the workspace so that when I am in this workspace chatting with AMA it will use this set of documents and you can see that it was downloaded successfully and so we can just close this window now let's reset the chat and ask it that same question again what is anything llm what we would hope to see is to get a response back which wow that was quick and we get citations and we can actually see what chunks exactly were relevant to my query that result Ed in the llm being able to complete this and it says anything llm is a full stack application blah blah blah does all this stuff that is accurate this is actually factual information we can go into the workspace of settings and we can you know go to the vector database we can increase the number of Snippets per chat it change the way that documents are deemed relevant but it's actually an easier way to just use llms and that is with agents and as I had said before this is not a capability built into llama it's not a capability built into llama 3 this is actually something that we have been able to do to apply to any llm that doesn't support function calling function calling is how all of this magic works and now you can unlock it when you use anything LM with any llm so what we want to do is we want to use AMA we have Ama we have our model that's it really don't want to use a worse model I think we have llama 3 but let's stick with the Q8 version and there are some default skills that exist of course course Rag and long-term memory we already saw that that's built into anything llm we should have the ability to look at the documents in our workspace modify them summarize them commit new information to long-term memory just from chatting and all of that we should be able to summarize these documents we should be able to scrape websites that's a feature just built into to anything LM we should be able or can generate charts I will admit this one is a little model dependent some models just aren't great with like you know you could paste in a CSV and say make a bar chart some models kill it llama 3 honestly isn't that great generate and save files to a browser so if we're talking to it and we say hey can you save that contact information to you know tim. txt it'll download it and save it on your desktop on this device and then of course live web search and browsing this makes any llm that you download running locally basically on par with perplexity and actually you can do it for free I'm sure you were like ah but I need an API key you do but Google actually offers this service totally for free you can just click on this link that we provide and it opens up this new programable search engine stuff you get 100 queries a day which is honestly pretty good we do support other like search engine results providers but this one's totally free and anybody with a Google account can sign up so let's connect mine and so we can get web browsing okay so I have that information put in I'm going to click update and now everything is saved let's go back to the chat window now keep in mind we had information about anything LM already stored in here so let's remove it so we're just going to go remove that right now if we are to reset the chat and say um what is anything llm we should again get a madeup response that has nothing to do with the actual tool we can actually get agent into the loop on this and the way you can do that is by typing at agent or you can click this and we tell you about how agents work um but if you click this you can actually see a agent is how you would invoke this you would say at agent can you scrape use.com which is our website and tell me the key features let's just call it that and what we should hope to see is this model go to use.com scrape that compile all that information specifically the key features and hopefully give us back a pretty good text response and you can see that we actually get what I would consider a pretty decent response but keep in mind this is not in long-term memory so let's ask the model to remember that for later so let's say thank you can you remember that information for later and what we should hope to see is the model recognize this as an available function and it say oh yes of course I will take the chat as it is right now summarize it and then save that for later so that when we ask in regular chat it would work and you can see that it's done that but now let's look at summarization summarization is one of the most asked and used features of anything llm it's not how rag works it's actually a pretty big misunderstanding that people think that you can just upload a document in a vector database and say summarize my document it's just not how Vector databases work but with anything LM you can do it and so I'm going to open up a new workspace so we'll just call it anything llm and we're going to upload that same readme document because I've already embedded in another workspace embedding is instant and now with no other kind of inferencing or leading or anything like that let's just ask the agent can you summarize readme.pdf which is the name of the file in the workspace and you can see it looks at the available documents founds a document called readme.pdf and then begins to summarize it again this is all running locally within my network because I'm using my Windows computer but it is summarizing and you can see that it says it summarized it blah blah blah did all this stuff mentions it's MIT licensed that is kind of the quick preview of what agents can do for any llm when you put them in anything llm and while I do recognize that this list of default skills is pretty limited right now I do want to really really emphasize that this is just the beginning for anything llm we're actually going to have the ability for you to Define your own agents like you would in tools like crew AI and any other kind of agent Builder that you know is already out there that'll just exist in anything llm anything llm plus oama can be your go-to for not only rag but also AI agents that can do things for you we have a lot more cooking on this front and so I'm really excited to show you this even in its current state and I also do want to remind everybody that anything llm is open source you can use the app that I just showed you right now today for free with no if ands or butts you just download it and get it running and the easiest way to support us is actually by starring us on GitHub we would really appreciate that more so I'd also appreciate feedback suggestions on new tools that you would like to see agents accomplish we'd love to know what you're working on and how anything llm fits into that flow so that's it for this short video I really appreciate your time thank you

Info

Channel: Tim Carambat

Views: 87,604

Rating: undefined out of 5

Keywords: ollama, ai agents, ai, lmstudio, local agents, crewai, superagent, llm, localllm, privategpt, llama3, llama3 agents, function calling oss, function call, ai tools, desktop app, ai desktop app

Id: 4UFrVvy7VlA

Channel Id: undefined

Length: 15min 20sec (920 seconds)

Published: Thu May 09 2024