Ultra Fast LocalGPT inference with Groq API

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

somebody recently asked me on the Discord server whether it's possible to use grock API with local GPT so in this video I'll show you how to do that if you're not familiar with local gbt it's my own project that lets you chat with your documents locally on your own machine uh using GPD models but in this video I'm going to just show you how to integrate the groc API so that you can get much faster responses when you're chatting with your documents a quick side note about local GPT one of the biggest feature requests that I have received so far is to have a nice looking UI and here's a quick sneak peek of what is coming to local gbt I think you all will really like what we are cooking let me quickly walk you through the step-by-step process of how to integrate the gro API within local gbt you can run local GPT on your own machine and for this example you don't need a powerful GPU because we're going to be using the gro API but just in case if you're looking for a preconfigured virtual machine so here is a link on the repo in category select Creator and in image create prompt engineering and you can use the code prompt engineering to get 50% off of this price that you see here so I'm going to be using this virtual machine that comes pre configured with local GPT but if you're running this locally you want to clone the repo first so here we have a virtual environment cor called um local GPT to First activate that virtual environment next you want to pull all the latest changes from local GPD repo because this is not up to date so if you pull the changes if there are any new updates that are going to be pulled here this one is already up to date okay after that you want to install the requirements file now we're going to make one change to the requirements file and that is going to be we're going to go to requirements. text and we will remove this pinned version of L chain because you need the latest version of L chain in order to run the grock API so after doing that let's just install it I have already installed everything that's why it's not going to install any packages for me we will also need to install one extra package and that is going to be L chain Das grock this will enable interacting with the grock API all right I think I have already installed this as well okay so we are all set here now in order to run the llm through grock API we'll just need to make a couple of changes in local GPT we have a constant. pi where we store all the constants and this also defines which embedding model to use as well as which llm to use but in this case since we want to interact with the Croc API we need to First import the required packages that we will need and that is going to be the chat groc from L chain groc and we're making this change in the Run local gptp file because there this is where we are uh using the llm the second change that we want to make it is in this uh function called retrieval QA pipeline if you look at the code here uh we actually loading the model that is defined within the constant. pi file so what we need to do is we just need to comment out uh this line so instead of downloading the model from hugging pH and then loading it we are going to Simply call the grock API here after that the next change is going to be to just add uh this line of code so now the llm is chat Croc we need to provide our groc API key and we also need to provide the model name so by default local GPT is set up to use Lama 2 models and that's why I'm using Lama 2 but you can also use the mixt Moe model the only change that you going to need to make it is here so in the prompt template type you will provide mistal instead of llama because mistal and llama have different prompt templates but other than that we should be good to go now if you don't know where to get the grock API key I covered that in a previous video I'll put a link to that video in the video description okay next to use this we need to First ingest a file so local GPT comes with a copy of the original orup paper and you need to copy your files into this Source documents folder so first we need to ingest this now in order to do that we're going to be using the python and and then in just. py command this will basically create chunks from the document that we have provided and we create a vector DB okay so here it created a vector DB and it has a total of 193 different uh chunks of text now you will see this warning you can install the Lang chain Community package as well if you want to but you don't need it right now okay so we are all set now we can start chatting with our document and in order to do that we are going to be using the Run local gptp file first let me show you how long it's going to actually take to use this rack pipeline if you were to use a model from hugging face so in this case we're going to be downloading the Lama 270 Bill model from hugging face so we set this model up in the constant. pi file next we uncommented this section so we actually want to download the model now in order to run local GPT we're going to use the the python run local GPT file now this machine has a6000 GPU which has 48 GB of V Ram So currently it's downloading the model from hugging face this is going to take uh some time the model is loaded we're going to run this prompt what is instruction tuning and how was the Orca model trained so this again is running on an a6000 GPU and we are loading the full model so it's not a quantized ver of the model and you can see that even though with this powerful GPU it does take some time okay so here's the response that we got it took like 10 to 15 seconds and it says I'm here to help you understand instruction tuning and how it relates to the Ora model then it talks about instruction tuning it provides a pretty good definition of instruction tuning based on the paper then it talks about the model training part as well so it says the the authors of the paper explains that the Ora model was trained on a data set that simulated zero short settings with standard prompts right so this is a pretty good summary now let me show you the speed when we are using the grock API we commented out uh this part and now we are using the grock llm uh and don't worry about the uh API key I'm going to revoke this right after recording this video but uh let's run the same python script now in this case it will still uh load the embedding model because we do need to compute the embeddings but it's not going to be using the local llm anymore and here you can see the message using the gro API now again we're going to use the same prompt what is instruction tuning and how was the Ora model trained and this is real time so this is actually how long the model took to generate a response this was pretty quick because it has to go through the retrieval part using the embedding model and after that it has to make an API call get a response and show us this was pretty quick and the answer actually also has two parts so first it talks about the instruction tuning itself and then it talks about the training of the ARA model so this is pretty accurate I think we can still prompt it to provide more detailed responses but overall the responses were pretty fast so hats off to Gro API they're doing a really good job currently the gro API is free to use there is some pricing information on their website but we don't know when that is going to go live another thing which I wanted to highlight was that while you're running local gbt you might see these warning messages and that's because we're using a relatively older version of Lang chain we are in the process of updating the code base uh to the newer version so if you're interested Ed in updates related to local GPT or Advanced rag there's a link for the mailing list in the video description my goal with local GPT was to create a simple framework where you can do a lot of experiments you can try different models from different sources you can try models from hugging face if you're running a model on olama you can use that or even you can use some external apis as well it also gives you the ability to try different embedding model models as well so there are quite a few new models on the embedding leaderboard you can switch those around as well here so I think it's a flexible framework that will let you experiment very quickly and once you like an approach you can actually use that through the local GPT API anyways this was a quick video on integration of grock API within local GPT I hope you found this video useful thanks for watching and as always see you in the next one

Info

Channel: Prompt Engineering

Views: 6,777

Rating: undefined out of 5

Keywords: prompt engineering, Prompt Engineer, LLMs, AI, artificial Intelligence, Llama, GPT-4, fine-tuning LLMs

Id: ws0xZCBQNoA

Channel Id: undefined

Length: 10min 2sec (602 seconds)

Published: Thu Mar 07 2024