6 Ways For Running A Local LLM (aka how to use HuggingFace)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

large language models like chat GPT and Antrophic Claude or Google bar are really powerful tools with lots of use cases however all these Services have the same big drawback privacy what happens if we are dealing with sensitive data we can send that over the wire even uh for reasons of compliance or even for reasons of trust we can't really trust uh uh sending data over to wire and we don't know what's going to happen to that so this is where having a a private language model we can run in our own machines really valuable there are literally hundreds of open- source models that have already been trained and are ready to use so join me in my quest to find a chatGPT replacement I can run locally in my laptop so before we start let's set the expectations there are hundreds of thousands of open- source models out there some trained by individuals on some like this train like corporation like meta to run some of these models we need a beefy machine with lots of memory and even a GPU some smaller models can run on a laptop but even with the Best Hardware open source model are smaller in size and a little bit less powerful than polished products like chatgpt after all open has dedicated hundreds of Engineers to maintaining the thing in this leak document it's called we have no mold and neither does open Ai and it goes into great detail into explaining how open source has solved some problems like running language models on phones or multimodality and uh here we have a comparison between uh closed and open source models open source models are quickly Going over new iterations one or two weeks apart and even they are smaller they can do a very good job compared to close Source models that are maybe a thousand times bigger so you probably heard about the site hugging face. this is the largest repository for open source models we can find projects and models that are uploaded by individuals or even by corporations you can find model from Microsoft from meta and here um we have a lot of things this is a really busy h page because it's not only about language model but also about things like image generation you know and and and really estate of the art uh AI but what we want for for our use case for language models is to go here to models and then we have to filter uh on here on the left side by tasks so we have here a a whole lot of task like audio or you know image processing what we want is something along natural language processing and I hear this is the first thing we need to pay attention because the kind of work we are needing depends on what task we we choose here so it's not the same to to select translation than to select summarization ideally we want to pick the best uh model for our job in my case for example I would like to have something to chat with so I will be conversational and this starts to filter um the other big filter here is in the library section and this is all the kinds of libraries and ways of running the language models uh most of these models are Transformers Transformers is a type of model uh this is the T in J GPT and it's also a library from Hugging Face a python library that simplifies setting up a model you know we also have popular libraries like pytorch tensorflow JSX which is a Google library then we have some formats like GGUF which is a format used for Lkama models and Keras is another big Library so we're going to pick something here that we know how to use or we want to learn how to use uh for for starters let's pick Transformers and then we can sort here on the on the right side for trending or most downloads and open one of these models will will give us information about the model let's speak one that I know it will work in my machine because um it's quite Limited in memory and for example we can pick uh Microsoft dialog GPT this is a gpt2 model which is a a smaller and older model uh but here we have the model car which basically explains uh how the model works and more importantly how to use it so you have some Snippets of code you can copy and and paste on your on your computer and run the model if you don't find any information here that's useful you can always click here on use in Transformers and you will find a couple of Snippets normally the first snippet is the one that H starts uh uh your program and this picks this models uh the Transformers Library will download the model automatically in your machine and set it up and obviously you need to complete the code with um the the logic you want to do in your in your library in your project so let's go and pick one of these so to get started we are going to need a few tools like the a compiler make or cake these are all things that I have installed in my machine but you probably need to have installed so the the library is compile we are going to start by installing a pytorch and tensorflow which will cover most of the things we are going to need for for this example we can use pip to install both libraries and then we're going to install Transformers sentence piece this is a version of Transformers that uh has the complete package you know all the tools you will need to run hiking phase Transformers so let's create a new file chatgpt and we are going to paste the code here this is the code in the model uh we are using this model and this is the exact code that we h we find in the model card uh the first time this rant is going to download the model which in this case is about 3 GB I already run it so it's going to sh start but you're going to see that there's a little bit of a a boot up a boot up time before we get the model to run this is a small model gbt2 really medium size so here we have uh a prompt uh uh you doing and we should get some response we have warning here I'm doing well how about you so this is basically the model is running using python uh tensor flow and the Transformer library from um from high in face as you can see to use this you really need to code the behavior because Hugging Face only provides the the the the part um that deals with the model how to start how to encode and decode the data but maybe of the model of the of the application you need you need to code it you need to develop it in your application so regarding hanging face Transformers we can find the pros is that uh it handles the model downloaded automatically and we have Snippets in the hanging face side to to use the models it's the best thing we have for experimenting and learning about machine learning and of course you can integrate the code into your own product for the cons H you do need a solid understanding of machine learning and natural language processing you know this is something you will need to learn uh to use this libraries uh efficiently and you need to code the application the behavior the logic everything you know it's uh on your on your side to to do H do you you need also to know how models are configured for best performance and it's not as fast as other alternatives we'll see a little bit later so you really need a powerful machine to run them locally so Lang chain uh is a framework for building language applications on top of your models uh it's an ecosystem that comprises uh connections to models they can be local or remote uh and all kinds of middleware to to augment your application it supports uh Vector databases and all kinds of of uh templating to create uh an application with python that uses a language model as an engine if we go for example here to components we will find all the large language models that are supported by Lang chain we can even use a um model from H haging face we have uh Integrations with Lama and kind of tools we can even integrate with open AI with this not what we want because we want something that runs on our Hardware so if you want to run the model with Lang chain you will still need the Transformers library and by torch or tensor flow depending on the model you also will need to install uh Hugging Face Hub you know this is something a tool that's useful to for for managing models and of course we will have to install Lang chain itself so now let's create as the same version we we did with Transformers but this times we're going to use Lang chain uh chat and we are going to paste the code here we are using the high in face pipeline this is a part of the L chain project uh and we we are using the the pipeline from High face the same model of before uh dialog GPT medium and then we have the prompts this is all in the documentation but basically uh we are using what you already have for from hang face Transformers and we're supplying here a question so let's see if it works and should this is not a chat it just going to reply to my question which is what is encephalography so after about 10 or 20 seconds we have the answer uh and it's working so we can use this uh to build uh our chat application so the so LangChain successfully accessed the model from Hugging Face and that uh means it's working and with this is the start obviously we can use a lot of more tools um that Lang chain you provides to build a very complex application with lots of features but this is uh the basis for running a model on L chain so going over LangChain we can sum up the cons and the pros in one list uh the pros is that is easy to use because it has an ecosystem you can run local and remote mode side to side and while you need to be a developer you need you don't need to be a machine learning specialist like with Transformers so you can focus on the logic of the application but you still need to develop the application uh the cons is the same as Transformers because uh it's as slow compared to other Alternatives we'll see later it's still running on python so you still need a good machine to run a model locally so so the third uh way of running a local llm is with Llama CPP this uh it's support of Llama models from Facebooks and meta that run on CN C++ and they we are going to see that there are most performant they they run bigger models on smaller Hardware I'm able to run uh a bigger model in this laptop something that I couldn't do with Hugging Face for example or lunching so here we have a basically an a Very optimized version of a model and they they have very a good uh support for Apple silicon so in this is ideal at least for me and for everyone that's using uh M1 on M2 chips um so it's a good way of of running a a local llm H using less resources okay to use jam. CPP we are going to need to basically clone the repository we're going to clone the repository once clone we can build the project with make so now we can execute Main and we see that it's not running because we don't have a model uh the Lama CPP uses a special format that's uh designed for this project that's gguf this is one way of storing uh the model format uh it's uh one of the most modern ways of storing uh a model uh compared with for example pytorch or tensorflow that uses bin or H5 uh this uh um format is uh is better you know so we are going to go to High INF fate and locate one of the models in this format download it and then test Lama CPP so back to Hugging Face let's pick a Lama 2 file there's a few if you want the official Lama files uh from meta you do you need to create an account on Hugging Face and the email must match the one you have in Facebook and you need to request access you know and that takes a couple of days uh but uh if we don't want to do that there's also a few models out there uploaded by individuals um we can pick for example um this one Lama uh Lama 2 7 billion parameters and it's in GGUF uh you can see here in file versions the uh model files there are different uh types of models the Q comes from quantization so uh it depends on how many bits the model uses H we can download for testing the late the smallest quantization model and it will take a while but once we have that we can execute it on our Llama.CPP okay so now now that I have the model downloaded I can uh use uh DM and call that I have it in my temp file and it's going to start and since I didn't produce I didn't ask anything I need put any prompt it just start spotting random text so let me cancel that okay and let's start again and now we're going to use um uh the -p parameter and this is going to uh initialize with some prompt and we should hopefully start getting some information uh so there's a lot of more options here we can create a template which would be the best you know to interact with Llama CPP but it seems that it's generating some some output so let's let's see what we get so we got an answer here and for reference this is a model that wouldn't run with a p or tensor flow so Lamma CPP is really more performant month we get at the end some uh Benchmark about uh time you know and uh so it's very interesting and it allows me to run things that I would normally be able to do going over to Llama CPP the pr is that it's a lot more performant you see that we can run models I couldn't have run with python uh thanks to being pure C and it has a lot of optimizations and can run bigger models with less hardware and I can run them in the command line or as a a browser application and there's a good number of options to tweak the behavior of the models you know and if you want to build an application you can still do that because it has bindings uh for various languages so you can have uh your logic in for example JavaScript and not GS and the model actually be running with h jam. CPP for the cons H is that it doesn't support all models in Hugging Face only Lama uh models or there's a subset of model that are supported uh you have to build the tool so you have you need some uh knowledge and tools installed and it's not as user friendly as some tools we'll see later on so while we're talking about Lama Cppl this is another project that's related from Mozilla it's called Lamafile or Llamafile and basically this is one portable script you can run anywhere you can run it on Linux Mac OS or even windows and basically you can download this file and download a model and run it without having to compile anything and you can even embed the model in an executable file that runs anywhere so you can share your models easily and other people can run them just by double clicking on the file which is great so to use this project uh we have a few options we can download the file directly and it will just work but since I already have a model downloaded I don't want to download it again I can just go here and use the project with external weights so I will download the Mac OS version going to download the file this works both on Mac OS and Linux and there's a separate one one for Windows we need all so to make it let me see executable and running it will throw an arrow because we don't have a model it's looking for the wrong model but we can always put the- M and Supply the model we already have from uh the last example and it's going to load it uh in this case it's not a command line application it will open a browser uh where we can uh interact with the model so so this is the starting page as you can see this is just a Llama.CPP and we have a few options here uh to to to tweak the model but let's let's let's um simply talk to to the model and see what happens so it's starting to produce text let's ask the same thing we asked before and we should since the model is the same it should provide a very similar answer so here uh the model is is working uh more or less at the same speed but in this case we didn't have to build or clone in repository just download one file on one model and you can see the the answer is shorter because the options we we supplied uh when when we started uh the the model uh we also have the option to upload images uh and ask the model about the images when the model is multimodal in this case it should work uh so this is I think an easy alternative to running Llama.CPP you have the same benefits but uh with an easier install so going over L file we can sum up the pros and cons in one list H the pros is the same as Llama.CPP it's fast and you can run big models on a small machine and the the other Advantage is that you can build a single executable file you can uh share with other developers to publish and it will run on any machine the cons is the same as Llama.CPP because you don't have all the models available for you only the the models available for Lama CPP and there's yet another uh Llama version here it's Ollama.ai. uh this site will uh leave give you a an executable file for Linux or Macos and once you download this and install it in your machine in installs a command line tool H and downloads the Llama file the Llama model sorry directly so it's I think this is the easiest way to run Al file because you don't have to you just double click and then run your your your query on the terminal uh let's see how it works so with Ollama we have to download the tool and install it and then pick one of these models uh that are here for example Lama 2 we can once we install it we can go to the uh to a terminal so once we install Ollama uh we have a service on the taskbar and then uh a command on the command line and the first thing we need is to pull a model you know we we saw that we have a llama2 model um available so this is the first step to pull the the model and next we run Ollama run and the model name and it should start a command line session we can we can talk with the model here let's ask the model the same thing we asked so far to see if we have a good answer so we have a pretty um we have a pretty similar answer it's a little bit slower than Lama P or Lama file but it's really easy to get uh the installation going and to download models now with Llama I think this is the easiest to install and use uh way to run a llama model it's just double clicking and and it installs a service on your machine and then you open a terminal and and just run uh select your model and and run your your prompts and get results you know maybe the the you can run Llama and Vicuña models very easily and it runs really fast H for the cons is you do have a limited set of models H there's uh less models than on Llama CPP currently maybe that changes in the future uh the other thing is that you can't manage the models and the tool manages downloading the models so uh if you already have models downloaded maybe you need to download them again there's not a lot of options to tweak this is really a tool dedicated to a user uh that isn't interested in using the language model not in tweaking the the things and the options and there's no windows version so if you're on Windows this won't work for you so the last product we are going to see today is GPT4All this is the most user friendly interface because it includes Ai and it allows you to Simply install the application select one model um it automatically downloads it and you can run uh your queries you can even uh add your own documents into into the application and you would go and index them and once you have the documents index you can query them you can talk to your documents you can ask the model to summarize or translate so this is a really good way of working with documents locally without sending them over the wire you know and keeping them private and and and safe so to get started we need to download the installer for our system so on we install the application and start it we get this window very similar to GPT the first thing is to go here and go to downloads and select one of the models here you will find the the available models um and we can select one of these and I already installed Mistal OpenOrca or but we I could download any others we have the size the sizes and we can also interact with open AI with the I but let's use this model that I already downloaded another thing that's uh useful here is to go to the options here and here we can change the prompt for the model temperature and all the parameters uh to tweak it and one thing that this application has is the that you can add a pad a folder for example I use my document folder here and we it will index that and and once it's indexed I can simply go here and enable this folder and now the model has access to all my uh files H in that directory for example this directory has a lot of uh books on cicd and we can ask for example that uh the model to give us a definition uh depending on uh the files it finds in in in my docs in my ebooks directory so as you can see we have an answer here and it even provides the context uh you know uh it identifies the book uh and summarize a definition from uh my my document so it's really a powerful feature of course we can use the model normally like any other model ask questions and everything but with documents uh we have uh more Alternatives so for gbt for all the cross is that it has a Polish guy it's a user friendl option we've seen today uh we can use local and remote online models like open AI it supports uh uh 30 billion parameter models easily in my machine so it's performing great and you can add your own documents for context so you can keep your data safe in your machine for the cons I would say that there's a very limited amount of models and some of the models are not able to be used commercially so the choice of how to run a large language model will depend on your needs if you want chat GPT uh you can use GPT4All which is one of the best ones around but if you want to build a project on top of a language model you will need something like LLama.CPP or even python if your goal is learning machine learning uh you need to Learn Python and use the the libraries for machine learning so I hope I made a case for running your own large language model the models are only getting better and I think the gap between uh closed and open source models are getting shorter and smaller all the time and especially there's a a case for using local uh language models because you are in total control of your own data that's all for now I really do want to explore AI in more detail in the future so if you have any request feel free to leave a comment if you like this video please leave us like And subscribe so you don't miss the next one thank you for watching and have a nice one

Info

Channel: Semaphore CI

Views: 15,971

Rating: undefined out of 5

Keywords:

Id: 7jMIsmwocpM

Channel Id: undefined

Length: 28min 20sec (1700 seconds)

Published: Thu Dec 14 2023