Llama-2 with LocalGPT: Chat with YOUR Documents

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'll show you how you can use the newly released llama tube within the local GPD project lets you chat with your document on your local device using the GPT models and no data leaves your device and everything is 100 private and secure so in this video I will show you a step-by-step process on how you can use lamba 2 models to chat with your own data sets I'll also walk you through some of the updates that have been made to this project since the initial release but before all that I would like to thank all the contributors who made this a successful project thanks to everyone we have over 10 000 stars on GitHub and it was trending on GitHub for quite some time now if you're interested in how the code works I would recommend you to watch the original video but in this video I want to show you how to set this up on your local machine and then we will use lamma 2 model as our llm as well as some of the changes that I want to highlight in this video okay so first and foremost we need to clone the repo so simply click on this green button and then copy the link now in order to use this you need to have GitHub as well as python installed on your local machine okay so I opened up a new terminal window and now let's go to the GitHub because that's where I install all my projects so I'm gonna CD into documents then GitHub okay and now we need to clone the repo so we're gonna say git clont uh and then we'll provide the repo link and after that I usually uh create a separate folder so let me call this local GPT and underscore lava okay number two all right so that's gonna clone everything now we can CD into this folder so let's see type in CD then look at this GPT so if you see I have a whole bunch of local GPT projects these are definitely different things that I keep testing all right so now let's type in LS that's going to show us the contents of that folder right so we simply clone the repo so far next we want to create a virtual environment so I'm gonna say create Dash and and then the word for the environment name so let's call it local GPT underscore number two right and then I want to use a specific version of python soul in this case I'm going to be using python 3.10.0 okay let's hit enter and then it will start setting up uh our condo environment for us so you will see something like this we'll simply proceed with the installation so if you hit enter the default value is yes so it's going to proceed with the installation okay so we are all done we created our virtual environment so let me just copy this and I'm going to open uh the code in a virtual code Studio instance and just to show you how you can start using the lamba 2 models within local GPT project so here I open a new visual code Studio instance there are a lot more files in here compared to the original code base that I covered in my initial video so most of this is related to formatting in uh request but there are a couple of major changes for example now we have a local GPT API as well apart from that you will notice this new folder called local ngpt UI so we actually have a graphical user interface as well I will cover this either later in this video or we'll have a dedicated video on this but today I want to focus on the running local GPT dot Pi file now first we need to open a new terminal so you can come over here this toggle panel click on it and this will open a new terminal for you since we created a new virtual environment so I'm going to first activate that here so let's copy this and we go back to visual code Studio and here you can simply paste that in and hit enter that will change the virtual environment so you can see it here now we are using the local GPT llama also environment after that we need to install all the required packages so for that I'm going to first type in python-m just that to make sure that we are using the python associated with the virtual environment and after that pip install Dash our requirements dot text file now keep in mind I'm running this on an M2 so the installation might be a little bit different than if you're running it on a Nvidia GPU so just wait for it to finish the installation and after that we will come back okay so we are done with the installation okay so let's first talk about what is different in this version compared to the original version and then I'll show you how to use uh Nama two models as an llm so there are basically two stages in order to run this project the first one is ingest dot Pi so the idea behind this file is that if you have your own documents or code base you want to create embeddings out of it and store those in a vector store and this is what the ingest.pi file does for the sake of time I'm not going to go into a lot of details of the implementations now a couple of major changes to the ingest dot pipe file first is now it supports multi-threading so the document ingestion process is going to be a lot faster than compared to what was it before other than that initially it was just a supporting Nvidia gpus but now you can run this on a CPU or you if you have apple silicon so you can use MPS as well right so these are different options that you can pass on and I'm going to show you how to run this uh updated version okay to explain the working so first you need to go and put all your documents in this Source underscore documents folder so as an example we have the constitution.pdf in here now if you want to ingest that or create embeddings for this document then you will simply type in Python and just dot pi then depending on the hardware that you have if you have an Nvidia GPU and you have a good setup on your local machine you don't need to do anything just hit enter and it will create embeddings and a vector store for you however if you are running this on a CPU or if you have apple silicon so you need to pass this other flag which is hyphen hyphen device type and then describe the type of device you have right so let's say in this case I have an MPS or I can use a CPU right both of them I haven't seen really big performance differences but just type this in hit enter and now it's going to use the instructor embeddings which are the default embeddings for this project to create embeddings for our file in here and then create a vector store on top of it now the project supports multiple different file types and you can have more than one file in here so that's not an issue at all okay so I'm simply getting this warning installed version of uh bits and bytes was compiled with a GPU support I don't have a GPU so I don't really need to worry about that okay so with the default parameters I have a total of 72 chunks and you will see here it created this DB folder and within this we have our Vector store Index right okay so all good so far now let's go to the Run local gpt.pi file this is basically where you're going to be running your llm as well as asking questions from your documents or chatting with your documents I think it will be helpful if we talk about some of the changes in this file and just like the ingest dot Wi-Fi now you have options of to choose which device you want to use if you have a GPU you can set it up to Cuda or that's actually the default value being used or you can use CPU or MPS originally when I put together this more this project it only supported the hugging phase format models however so now thanks to the community help we have extended the support for almost all the models that are available on hacking face so if you want to use a quantized model and you have an Nvidia GPU you can now use the gptq format files or models in here also this project now supports the ggml format files thanks to llama CPP and again I'm really thankful to all the open source contributors who contributed to this project okay so let me quickly walk you through the process of how you set up a new llm uh so in order to use llm of your choice you need to set up two things first is going to be the model ID and the second is the model base name if you're using a full uh on quantized model in the hugging phase format you simply need to set up the model ID and make sure the model base name is set to none however if you are using let's say the quantized models for example if you're using the gptq format models then you will need to define the model ID as well as provide the model base name now let me show you where you can get this information from okay so for that let's go to hug new face and we're going to be using models from the block now let's assume you want to use GPT Q quantize model so in that case you simply need to click on it for example this is the model that you want to use so you will click here now the model ID is going to be this so you can simply copy the username and then the model name so that's your model ID and for the model base name you need to go to files and versions and then look for uh the files with DOT safe tensor so that is going to be your model base name now keep in mind if you want to run models in the GPT queue format you need to have an Nvidia GPU for that however if you want to run models on the CPU and still want to use the quantized version then you want to look for ggml format right I think ggml now also supports GPU inference using the Llama CPP so you can use those directly as well now again in this case the model ID is going to be this and then the model base name so here you have a whole bunch of different options because there are different quantization versions that you can use so for example this is a 4-bit quantized model this is three bit quantized model right and each one of them have different sizes so depending on your Hardware you need to Simply copy uh the name of the quantized version actually the name of the model with specific quantized version and that will become your model base name so once you have both the modeling as well as the model base name then you need to Simply come back and populate that in here so for example I'm using the block llama 27b chat ggml model here and then uh I'm using the 4-bit quantized version uh that's how you are going to set up different models now today you're setting this up within the running local GPT uh file and here's this main dot main function right but in the near future I think I'm going to remove this and put it somewhere else to make it a little more uh well organized okay uh I'm gonna go over this load underscore model function just to show you the internal working of how uh these different type of models being selected and used right so basically you need to pass the device type so that's either CPU GPU the for GPU it's a Cuda or MPS then the model ID and the model base name let's have a look at this function before running the local GPT file all right so as I said it expects three different inputs device type model ID and the model base Name by default it's set to none so that's when it expects to run a GPU right so first and foremost it simply checks if the device type is CPU or MPS given that the model based name is not none right so basically if you're using a quantized model uh now what it does it it simply uses the Lama CPP assuming that you have ggml base model okay so first we check uh if we are running on an Apple silicon in that case we simply unload some of the GPU layers right if we're running on a CPU we don't do that now a couple of things that you want to set is the context length so by by default the original Gamma 1 models they had a context for window of 2048 but now in the newer ones it's four zero nine six so I'm going to just change that and this is how many maximum number of tokens you want to generate so you can set it to that value as well uh actually this is going to be including your prompt as well as the response right but let's set it to the same value okay and you can update it here if you want but I'm going to be running it on MPS okay so this is the one change that you want to make if you're running the second version of llama models all right now the second option is if it has the base name provided but the device type is not CPU or MPS then it means you're using uh the gptq format for quantization and in that case we're using the auto GPT queue library right so you might want to change some of the settings for example if you have multiple gpus you can set this up to let's say two of them rather than just the first one and so on and so forth now if the model base name is not provided then there are two options first is checks whether you're running it on a GPU or not right so if you're running it on a GPU then it assumes you're running the full model and it will use the automl for causal LM if you're running the CPU version then it will simply use the Llama for a causal LM for full model after that we get the corresponding configurations from the pre-trained model and we create a pipeline and return the local llm as a part of the pipeline now just to reiterate um if you don't want to know the technical details all you need to do is set up these two things model ID and model base name model ID let's say in this case is the llama27b chart ggml model but you can use the 13B or even 70b quantized models right now let's see how you can actually run this code so I'm going to clear everything now in order to run the local GPT so we will type python then run underscore local gpt.pi then dash dash device underscore type hyphen hyphen device underscore type and we want to run this on MPS so let's hit enter so in this case uh it's going to go and first download this model automatically okay so once you run the command then it will load the model using llama CPP so a couple of things to notice in here first and foremost where the data is stored so you have the database path then the model that is using so you notice that it's using the lama2 7B chart model and then which device it's running this model on so we are running it on MPS okay and then there are some other parameters associated with lava cpv so in this case again we are using the um Constitution dot PDF as our source document and after that you can actually start chatting with your document so let's say here's a simple prompt what is the term limit of the use president let's hit enter and we get the answer so it says the term limit for the US president is two terms or eight years in the office this limit is set forth in the Article 2 Section one of the U.S you you can also ask the model to give you the sources it's using so if you want to get out of certain chat session simply type in exit and let's create this up again now in order to enable the sources we will need to pass on an extra flag and that is show underscore sources I believe that's the one let me just confirm right yeah that's the flag okay so type and hit enter again it will simply load the model okay so same question or similarly similarly worded question the answer it came up with this time is maximum of two terms right and now here it shows uh different sections or chunks that it used uh in order to retrieve that information okay I want to highlight a couple of things in here sometimes the responses that you get are not accurate right and there are a few reasons for that which you can experiment with so one of them is the type of chunking process that you are using uh so we really need to pay attention to the for example the recursive text Splitter on how you are splitting your documents that is very important second is in this case we're simply passing on in the vanilla prompt that we are using however you want to use a prompt template which specifically asks the model to look for answers based on context that is provided so I will be covering that in a separate video where I'm going to show you how you can provide a custom prompt template because most of these models that we're using specifically do open source models each one of them have a different prompt template so I will be making a dedicated video on how to integrate or use different from templates based on the model that you choose and that will definitely improve the performance of your information retrieval another thing which I'm really excited about is the local GPT API as well as the GUI in here and these were contributions made from the community so in an upcoming video we will go into a detailed tutorial on how to use the local GPT API and how to create GUI based on the API and so this is going to be very powerful because this lets you host your local GPT module models uh anywhere in the cloud or on your premises uh your own hardware and then you can make API calls to the model so that's something I'm really excited about and I'll be covering that in an upcoming video so watch out for these different videos that are coming up another uh thing I want to mention a major change is coming uh to the local gbd project there are a few uh pull requests specifically this comprehensive refactoring and enhancement of code base which will really change the entire project the way it's set up right now and I think that would make it a lot more modular as well as powerful you will be seeing a lot more videos on the local GPT project and upcoming days weeks and months uh if you would like to contribute to the project simply look at the current issues that people are having and create pull request bills based on those and I would love to merge so all the contributions are welcome and if you like the project make sure you start it on GitHub if you're already using local local GPT for your own projects or thinking about it I would love to chat so don't hesitate to reach out I want to see what people are building on top of this I already had some really amazing conversations and some great applications that some of the community is building on it if you run into any issues you can create an issue in here we have a very active Discord server where people are sharing their problems findings and what they are building on top of local GPT project so consider joining us on the Discord server I hope this video was helpful and you found it useful if you did consider liking the video and subscribing to the channel if you are looking for a trusted advisor on your projects you can reach out to me details are in the description of the video thanks for watching and see you in the next one
Info
Channel: Prompt Engineering
Views: 63,199
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, langchain, langchain javascript, gpt-3, openai, vectorstorage, chroma, train gpt on your data, train openai model, embeddings, localGPT, Llama-2, localGPT and Llama2, llama-2
Id: lbFmceo4D5E
Channel Id: undefined
Length: 23min 14sec (1394 seconds)
Published: Sat Jul 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.