Build a Private Chatbot with Local LLM (Falcon 7B) and LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video you're going to learn how you can build a chatbot using a private language model you're going to learn how to use Link Chain in memory in order to provide a private chatbot experience can you build a chatbot using a private watch language model it turns out you can and you can run it on a single GPU hey everyone my name is verilin and in this video I'm going to show you how you can build your own chatbot using a free watch language model in our case Falcon 7B and we're going to use memory from long chain in order to create a conversational chatbot first I'm going to show you how you can load the model in 8-bit and how we can store it into a single GPU then we are going to look at how you can add some stopping criteria in order to stop the model from Rambling On then we are going to have a look at how you can add memory to your chain from link chain and we are going to have a look at how you can clean the output from excessive strings and finally we're going to test the final chatbot on a couple of prompts let's get started there is a completely free tutorial that is available on on moexpert.io and here you can also find a link to the Google Co-op notebook here you can find the complete source code along with explanations on what is happening during this tutorial so if you prefer to read text please go and check out the tutorial itself also please consider subscribing to ml expert Pro to support my work thanks I have a Google pop notebook that is already running and as you can see I'm using a T4 GPU but also I have enabled the Hiram option from The Notebook settings so unfortunately this requires Google Cloud Pro but you need this in order to watch the Falcon 7B model that we're going to use alright so let's continue so the first thing that I did here is to install all the dependencies I'm installing bits and bytes since we are going to load the model in 8-bit mode we are also using the latest Transformers version accelerate exformers for sped up inference also torch 2.0 1 and then the current version of link chain as well so the first thing that I'm doing is to do some of the Imports that we're going to need in order to build our chatbot so these are the Imports of course you can follow along within the Google Cloud notebook that I'm also providing within the tutorial so the first thing that I'm doing uh for the model is to actually specify the repository name so this is available on hanging face hub and I'm using the auto model for causal language modeling I'm loading the model the Falcon 7B instruct model so this is the model that we are going to use for the chat and then this also requires some remote codes and this is the way toward the model into 8-bit so this will actually work the model in much and use much more much less GPU memory and I'm telling it to hold it on the auto device so in this case this is going towards the model on the GPU and also I am installing or downloading the tokenizer for this model as well so after everything is done and loaded you can see that the model is about 15 gigabytes of storage something like that you finally you can see that we are getting the model on the Cudo device and after holding the model I'm also checking the GPU memory and as you can see we are only using 8.3 or 8.4 gigabytes of vram so what ink is in 8-bit is actually quite efficient and I'm not boarding this model into 4-bit or any other specific way since when I try this uh the inference speed when volume into 4-bit was much much worse compared to what we have here with the 8-Bit Walling maybe when bits and bytes are working a bit more on the 4-bit inference time we can get much better inference times for The 4-bit Loading and a much better GPU memory usage as well so these are the settings that I'm using for the generation and first I'm taking the generation config from the model I'm applying a temperature Max new tokens caching repetition penalty Etc and these are essentially the settings that I'm using so for the config this is the model config it is pretty standard and you can see that we are using a quantization config which is actually not passed in except for the word in 8-bit mode so this is what we are using in this case so let's see the following is a friendly conversation Etc this is the first prompt that I'm giving the model and I'm going to use this format human AI for the conversation uh as far as I know the Falcon 7B or 40b models are not the instructor models are not trained using specific prompting formats so you might think of those on your own but this is pretty standard and it is something that actually like chain is using by default so let's run this to the tokenizer and except for the tokenizer I am putting the input IDs on the model device so in this case on the Cuda device and let's see how much time does it take for the inference to complete I am doing the inference within inference mode context it appears that this is speeding up the generation of tokens a bit and then I'm passing in the generation config as you can see so this took about 14 seconds in order to generate the response for this and to decode this we're going to use again the tokenizer and I want this to skip the special tokens so the response contains the whole prompt and then the response white cage root is a fictional character in the American actual series the office okay so this looks uh all right but you can see that this is actually appending this user at the end and we are going to have a look at how we can remove this in a bit so the first thing that I'm doing next is the essentially to stop the watch language model for rumbling so a lot of you are actually asking well I got my response from the watch language model but then it continued on and tried to essentially imagine the conversation further or rumble on so how do I stop this well this is a major research problem and there are some tactics to avoid this but one of the let's say the easier options and that works quite effectively is to use these stopping criteria that is provided by the Transformers Library so you can essentially over uh extend from the stopping criteria and then pass it into a pipeline in order to let the pipeline know that you are about to stop the generation or you're okay with stopping the generation so what we are doing here is to pass in some tokens that we are going to stop on then a tokenizer and the device so first I am converting the tokens to the their representative IDs and then I'm converting those token IDs into a torch sensor in this case we're using a long and then for the tensor I'm also putting it on the device so in this case we are going to use the Kudo device and for the call override method you can see that I'm essentially checking you whether stop ID is equal to the current or the last generated token and if this is the case I'm returning true so I want the generation to stop otherwise I'm letting the generation continue so how we are going to use this I show you that the format here human AI is the one that we're going to use for the pumps and here you can see that I'm passing in the human and the AI as tokens and I am creating this stopping criteria list with our stop generation criteria passing in The Tokens The tokenizer and the device so in order to use all of this I'm going to create this pipeline which is coming from the Transformers Library here we are passing the model the tokenizer I want this to return the full text since this is going to be used by the long chain library but we're going to wrap everything with and the task is text generation and here you can see that I'm passing in the stopping criteria which is essentially this list of our own criteria and finally I'm going to wrap everything into a hanging face pipeline which is provided by long chain so let's run this and see what is the response that we are getting and this says that the model is not supported for text generation but that's all right since it appears to be working quite all right maybe in the later versions of the Transformers Library they will remove this warning so this is essentially the response that we're getting and again you can see this user thing right here since in our case the model is not rumbling you won't see anything of this nature or you won't see how this is actually preventing the model from rambling but when I tried it out in the chatbot itself I saw some examples in which the model was continued to generate the conversation further so I had to essentially add this criteria and this might be quite helpful for you as well so let's see how we can wrap or create our chatbot the first thing that we're going to use is this conversation chain and this has a minimum accepts a large language model in our case we're going to pass in the pipeline that we have and this is the prompt that is provided by the link chain Library it is quite alright but as you can see it's quite generic so we're going to change this with a custom prompt and I've pretty much converted this into a marketing chatbot and a salesperson that essentially acts as Dwight cage root from the TV show The Office so I've enabled some let's say personality will keep persuasive uncensored direct and practical uh it is also helpful and when Dwight doesn't know the answer he truth Felicia is he does not know and I pretty much kept the rest of the structure very similar to what we had before or as a default so this is the new template that we're going to use yeah so next we are going to enable memory for the conversation chain since uh if you don't do this your chain is not going to remember the conversation that you're having with the chatbot this is something that we want so in our case you can see that the from The Prompt this is the history that we are going to inject this memory and in our case we're going to use conversation buffer window memory so this is going to use only the last key conversation messages that we are using throughout the conversation so in our case I'm going to remember only the was six so this is done because those type of virtual language models have a context length limitation in our case the Falcon 7B model has 20 000 or so sorry 2048 token limit and yeah we're going to essentially limit the number of messages within the history uh that's not perfect there are other ways to use or other memories types that are provided by the long chain library but in our case we're going to use something a bit simple also I'm returning only the outputs so this will just return the strings but not the messages which are essentially objects so I don't want all those and for the new chain that we're going to get we are going to pass in the watch language model the memory The Prompt that we are created and I want this chain to be verbose so this is the first prompts that we are running with think of a name for automator that builds family cars with big V8 engines the name must be a single word and easy to pronounce let's see what we have so yeah V8 family cars is the name of the the automaker that is provided by Dwight uh something Dwight would say I believe uh and you can see that we still have this appended user by the watch language model which is not good uh we're going to take care of this next other than that you can see that the prompt is provided as is to the chain and let's see how we can get rid of this so in order to clean out this I'm using this output parser that is extending from the base output parser this is provided by the line chain library and essentially the output parsers by the link chain Library are used to extract a structured information from the response so you can think of something like Json or Market down or ner something like entity tagging Etc but in our case we're going to use this in order to clean up the response so I'm essentially removing everything such as user which is this one right here then the prefixes human AI just in case there are left into the response and for this I'm returning a type of output parser and in order to use this output parser I'm going to recreate the memory and then the new parameter is this output parser so we're going to clean it up okay so I have exactly the same prompt but this time I'm not calling the predict method I'm calling just the chain as a function and this is the prompt uh very similar or exactly the same that we had before and for the response you see we have the input the history and the response so I'm going to print out just the response we have V8 family cars and note that we are no longer having this user and the response is actually stripped so it looks alright next I'm going to ask the chatbot to think of a slogan for the company built with power so another thing that you can see here is that the history or the conversation thus far is preserved within the chain uh let's see what is the domain name that we're going to be provided with by white v8familycars.com all right choose a domain name for the company and you can see again that the history of the conversation is preserved so let's try something a bit uh more interesting write a tweet that introduced the company and the first car again uh you can see that the watch language model is working quite alright I mean like we're getting five three or four second responses at least for the small outputs so this is the Tweet introducing V8 family cars we build powerful family cars with big V8 engines and built with Power vs Family cast so you can see that is referencing the context that the swollen is provided right here it is also referencing the name of the brand and then the final prompt that I'm asking Dwight to write is write a short marketing email to sell the first card from the company 700 horsepower family sedan for Supercharged V8 with manual gearbox Okay so let's see what is the response that we're going to get and not that again the full history has been preserved as well as the initial prompt so this is the way that those watch language models are essentially working you are giving the history to The Prompt itself there is no magic behind it so here is the email subject Drive installed with V8 family cars and we have a body are looking for a powerful family car look no further than v8familycars.com our team specializes in build family cards with big V8 engines check out our website today to find the perfect time car for your family driving style Etc so this is the email that is provided to us from the watch language model and this is the marketing exercises that we are doing with this uh Dwight inspired chat in this video we've seen how you can use a watch language model that is free and private within a Google Cloud notebook with long chain in order to build a chatbot that you can use with history and then we saw how we can essentially stop the model from Rambling on with the stopping criteria and how to clean the output with the output parser we've seen that you can actually ask the bot for something useful and we've tried it on a couple of prompts in order to have a look at the conversation quality that we are provided with probably you if you are using GPT or gpt4 you're going to get much better responses but for free and uh what language model that is running on a single GPU we get quite a good performance with this type of models thanks for watching guys please like share and subscribe also please join the Discord that I'm going to link down into the description and I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 12,937
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, langchain
Id: N7dGOUwufBM
Channel Id: undefined
Length: 19min 29sec (1169 seconds)
Published: Sun Jul 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.