LocalGPT Updates - Tips & Tricks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

well the last week or so I have made some really exciting updates to the local gbd project including support for the new gguf format model and a much better prompt template to restrict answers to the given documents so I'm really excited to share these updates as well as show you some tips and tricks when it comes to using local GPT if you're not familiar with local GPT it's my project that lets you chat with your documents using the power of Open Source large language models everything is on device and 100 secure so in this video I'm going to walk you through some of the updates that I have been working on over the last few weeks as you can see the whole documentation is Rewritten from scratch I'll also address some of the issues that people are having when using local GPT okay so let's start with installation here I'm going to show you how to enable GPU support when it comes to Lama CPP now in order to set it up you first need to clone the repo then create a condo virtual environment activate the corresponding virtual environment then you can install all the dependencies except Lama CPP using pip install Dash R requirements.txt so this will install everything that you need except Lama CPP llama CPP is the main package that we are using to load ggml and the new gtuf format models it has support for both CPU and GPU but in order to enable GPU support you need to install it in a specific way so let me first show you how to install this on a MacBook with M2 Chip and then I'll show you how to do the same if you have an Nvidia GPU if you have apple silicon you simply need to copy this line simply copy it here I am in my vs code so I have already activated a new virtual environment called local GPT and I simply need to paste it here that command so it basically set this environment variable and if I hit enter it will install the Llama CPP okay another thing that you probably will notice is that I'm using a specific version of llama CPP this is not the latest version but within this version they added support for gguf model format but unfortunately lamba CBB has dropped support for ggml format so if you have a ggml format model and you want to run those you will need to install llama CPP version 0.1.76 or earlier now for my testing I'm using the version 0.1.83 because with the latest version I wasn't able to enable Hardware acceleration so that might be my specific issue but I usually like to stick to a relatively older versions because those are more stable if you are running this on a Linux machine you can simply copy this line if you have an Nvidia GPU assuming that you are running this on Linux you can simply paste this command so in this case basically we are using the Cuda plus that is set to on and then the rest of the installation instruction in most of the cases this should work however you might run into an issue so in that case you probably want to set the environment variable manually and the way that you do it in Linux is so let's say you want to Simply set this cmec arguments so you can simply type in export and then set the environment variable you will need to do the same thing for the second environment variable as well and then you can run the PIP install command to install llama CPP now let's assume that you want to install llama CPP on Windows then in order to set the environment variable the command is a bit different now in that case instead of export we're going to use a command called set X and that will set the environment variable for us so you will again need to set both environment variables so that Force make equal to 1 that's going to be also set and after that you will simply use the PIP install llama CPP you can provide the version number if you want otherwise it's going to install the latest version and then you can set this extra parameter now cache directory so this is how you install Lama CPP with GPU support on different platforms okay now before looking at some of the changes that I have implemented let's look at one of the questions that I see a lot being asked both on GitHub as well as on the Discord server and the question is related to the hardware requirements in order to run local GPT so one of the Community member on the Discord server created this table based on their experimentation and it's extremely helpful in order to figure out what type of Hardware you would need in order to run different models now in local GPT there are two components one is the llm and the second is the embedding model now in terms of the llm if you're looking at running a 7 billion parameter model with a 4-bit quantized version then you are looking at around four to five gigabytes of vram and we are specifically talking about the gptq format now since we're using the instructor embeddings for the information retrieval part you will also need somewhere from two to seven gigabytes of vram depending on what type of instructor embedding you use now by default the embedding model that we are using is around 4 gigabytes so let's say if you want to run a 7 billion parameter model you are looking at around eight to nine gigabytes of vram so this table is part of the GitHub repo and I would recommend everyone to actually look at it before deciding what type of model to run I will also recommend everyone to check out the Discord server because you're going to find very helpful information when it comes to large language models now let's look at some of the features that we have added as well as some of the code changes one of the main features that we have added is the support for gguf format models because that's the model format which is supported by llama CPP out of the box now that means now you will be able to run almost all the models available on hugging face except Falcon and MPT based model support for those models is coming soon okay so now let's look at some of the code changes that we have made so if you look at all the files in the repo you will see a couple of new files for example there is a load models file and then there is a prompt template utils file we going to look at those in a little bit okay in order to ingest a file use the same old command that is python in just.5 and then if you don't have an Nvidia GPU then you can pass on another parameter which is device type and here since I'm running this one in M2 I will pass on the MPS if you want to run this on CPU you can simply pass CPU now in this case I'm actually using the original Orca paper for ingestion so it simply splits it into different chunks so there are a total of 195 chunks and it will create a vector store so here when I run it for the first time you will see that it will create this new folder called DB so this was the original functionality for ingestion of documents now in order to ask questions from your documents we're going to run the main local GPT file again you can pass on the device type so in this case I'm going to pass MPS however you can also pass a couple of other parameters as well for example if you want to show sources so you can type in show sources that will basically show you all the different chunks that were retrieved during the information retrieval process and you can also set this extra flag called use history so this would basically enable to use the chat history in your retrieval process now this behavior is disabled by default because we have very limited context window when it comes to these open source large language models so you will have to enable it using this a specific flag okay when I run this you will actually notice a few things so first and foremost display of source documents Etc so when we ask a question it's going to also show us the corresponding chunks history is also set to true so basically we can use a chat history as a part of our conversation now in this case you will notice that it is using this llama27b chat gguf model so this is the latest model version that lamma CPP supports okay and um as before you are presented with this prompt enter a query and you can ask a question here now to check whether uh Lama CPP is actually using your GPU or not you need to pay close attention to this parameter so if blast is set to 1 that means it's using the GPU if it sets to zero it means that it's not using your GPU now the document that I'm interacting with is this paper Progressive learning from complex explanation traces of GPT form so let's ask some questions relevant to this paper so my question is going to be what are the different types of prompts discussed in this paper and let's see what the answer is okay so if you look at the performance it actually is able to get pretty good tokens per second on my M2 with 96 gigabytes of RAM but let's look at the answer so the question was what are different types of prompts discussed in the paper and it says based on the provided context there are several types of prompts discussed in the paper so it's talking about wizard LM prompts then awesome chat GPT prompts then re-prompting and also I think it's discussing scaling instruction fine-tune language models okay so this is the answer that the llm generated now the quality of the output from the llm is really dependent on the sources or chunks that you get and the chunks depends on how you split your documents so if you want to learn more about the splitting process I have a dedicated video on the topic I'll put a link to that video now when you build chat Bots based on your documents you want the answer to be restricted to the information provided in those documents and that is one of a big issue that I have seen people face so let's see if local GPT can actually restrict the answers to the document that we provided so let's ask a completely irrelevant question so I'm going to say who is the CEO of Twitter and let's see if the model is going to give us an answer okay uh so here's the question again who is the CEO of Twitter and here's the answer that we got so based on the provided context the CEO of Twitter is not mentioned therefore it cannot provide an answer to this question now as you can see the answers are actually restricted to the context that you have provided now keep in mind that we are using just a 7 billion parameter model so the model can still hallucinate now the question is going to be how we were able to restrict the answers so that comes down to the system prompt that we are providing to local GPT now in order to understand that we need to look at this new file prompt template utils.pi so let me open this up okay so here is the system prompt that I'm using so I have experimented with quite a few different system prompts for Lemma 2 based models and this one seems to work pretty pretty good when it comes to information retrieval from your documents so now there is a dedicated function for this so if the prompt template type is Nama it will adopt the lamba 2 prompt template if it's anything other than Nama then it will simply Define a prompt template where it will use the system prompt then the provided history and context if the history is enabled and the corresponding question but in case of Lemma 2 so it will basically format the same thing and there's a specific prompt template now where exactly is that being used so for that we need to go to the Run localgpt dot Pi file now you will notice that the code is much cleaner now so there is this main function retrieval QA pipeline if you go here so in here we are calling that function get from template then we're passing on the prompt template type so in by default it's set to Lama and then whether we want to use history or not so basically this the function gives you the ability to Define your own prone templates based on the model that you are using so if there is interest I'll make another video where I will go through this uh updated code base another question that I have seen a lot during the discussions is where are these llms stored after downloading so now you will be able to actually Define a custom path so if you look at this code segment within the Run local GPT file I'm simply creating a new folder if it doesn't exist and the path is coming from Models path variable so for that we need to go and look at the constant.pi so here you will see that I'm defining a path variable now if I go here you will see that it has a new folder called models and this will basically keep track of all the models that you are downloading in a single place so essentially you simply need to Define your path in here and models are going to be downloaded to that specific directory now currently this works with the hugging face model types ggml gguf but I'm also looking at adding this support for the GPT Q models now let's talk about a couple of more important parameters that are moved to constant.i now so first is the context window so based on the model that you have selected you need to Define your context window in here and then you can also Define the max new tokens that are being generated so by default I have set this to context Windows size however you probably want to use a fraction of the context window so I might actually change this to 1 4 or half of the context window but this is going to be basically the center place where you can Define your context window now another thing that you want to Define is the number of layers that you want to offload the GPU so you can define those in here and now the Llama 270 Bill model has a total of 73 layers that you can offload so I have set this to 100 so depending on your Hardware that will basically determine how many days it can offload now when it comes to embedding models so here you have a number of options so by default it's using the instructor large model and it's actually using around 1.5 gigabytes of vram and then there are some other options that you can use and here are the details with the corresponding vram requirements now if you need multilingual embedding models so we have also provided a few examples in here you can explore one of these as well now in terms of the llms you will need to provide both the model ID as well as model base name if you're using the unquantized model for example the hugging face models then in that case you will set the base name to none but by default we are going to be using the number two seven billion chat model in the gguf format and that's going to be the four bit quantized version and within the constants dot Pi file we have the same table which simply lists down the vram requirements for different models and different quantization level so if you go to the load models file there are three functions one is the load quantized model gguf and GG ggml the second one is load quantized model gptq and the last one is load full models right so basically I have removed all the functionality from the main run local GPD file and put it in here to make it more modular and easier to understand another change in the code is that now the retrieval qf pipeline is moved to its own function so you simply need to provide the type of device that you are running this on it can be a Cuda Nvidia GPU CPU or MPS whether you want to use chat history or not and then whether you want to use the default prompt template that comes with llama gamma 2 models or not right and that will call both the embedding models the vector store that you have created it will get the corresponding prompt template as well as the memory which will load the model and then it will simply run everything through the query tool change and you will get a response so in this video we looked at different features that are newly added to the local GPT project if you want a more detailed Code walkthrough let me know in the comment section below and I'll create another video and if you're finding the local GPT project of interest and usefulness consider going to GitHub and give it a start that will be really helpful also if you want to support my work check out both my patreon and you can also buy me a coffee there are a lot of more exciting things coming to the local GPT project so if you want to stay up to date consider subscribing to the channel and hit that Bell notification button so you don't miss any updates thanks for watching and as always see you in the next one

Info

Channel: Prompt Engineering

Views: 18,562

Rating: undefined out of 5

Keywords: prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, langchain, langchain javascript, gpt-3, openai, vectorstorage, chroma, train gpt on your data, train openai model, embeddings, localGPT, Llama-2, localGPT and Llama2, llama-2

Id: G_prHSKX9d4

Channel Id: undefined

Length: 19min 38sec (1178 seconds)

Published: Sun Sep 17 2023