Customer Support Chatbot using Custom Knowledge Base with LangChain and Private LLM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video you are going to learn how to create a chatbot that answers questions based on custom knowledge base and you're going to do that using an Open Water language model and open embedding model link chain and also you're going to learn how to stream the responses using just a single GPU hey everyone my name is Benin and in this video we are going to learn how to build a customer support chatbot using custom knowledge base first we are going to start with creating our own data set that is based on skyscanner's help center FAQ then we are going to learn how to what a open Watch language model right within a hanging face pipeline that is provided by link chain then we are going to embed our documents data using chroma DB and an open embedding model finally we're going to wrap everything within um chain with memory in order to start chatting with your R chatbot we're going to demo the whole Pipeline and you will see whether or not the model is actually performing quite well compared to for example something like charge GPT or gpt4 let's get started there is a text tutorial that is available for ML expert Pro subscribers and here you can all already find the complete source code along with some explanations on why and how you can create your data and the model itself so if you want to support my work please consider subscribing to ml expert Pro to create the knowledge base for our chatbot I've used the Skyscanner help center and here you can go through a lot of the questions that they have for their customers there are split into different categories and what I did was to take about 12 questions and took the questions and answers and if you expand a question you'll see the answer for it so what I did was to just take the question and then the answers and then I'm going to show you how I've prepared the text files of course feel free to use your own resources as customer knowledge or custom knowledge for your chatbot and let me know how it went in the comments down below I have a Google Cloud notebook that is already running and here you can see that I'm using a T4 GPU I'm going to go through the runtime type and you'll see that I'm using just a GPU runtime with T4 GPU not an additional RAM or anything like that so you should be able to run this in Google Co-op free so the first thing that I do here is to install all of the available or all the required libraries that we are going to use and then these are the Imports I'm going to earn those so the first thing that I'm going to do is to take the questions from the Skyscanner help page and in it I am going to create a directory called Skyscanner and within this directory I'm going to create the files so here I have a function called Write file and in it I'm going to pass in the question the answer and then the file path so this is the format that I'm going to use it is going to be the text string and it's going to start with a queue and double dot then the question and a double dot and answer so I'm going to write this file into the fire pad that is provided as an argument and here I'm going to start by writing this directory and after that I'm going to insert all of the questions that we have so in that way I'm going to run through all of them files and I have both of those I've pretty much went and picked them on my own so you might write some different files and you might have a look at why and how it works with different files let's see the directory structure after the files are complete and here you can see that we have the 12 questions and if you open up a single text file you see that basically the format is skipped and we have the questions and the answers right within here so this is pretty much how you create your data the model that we are going to use is called nus hair mask 13B and it is a state of the art according at least to the authors model that is trained on over 300 000 instructions and it is provided to us by news research and the guys from here are essentially fine tuning the Walmart 13B model and this should leave you a clue that this is actually another model that is approved for commercial use so keep that in mind when you're choosing your model and the authors on hanging face have a lot of information and this is the prompt format that you might want to use we're going to use the same format actually right here so they have some Benchmark results but the real reason why I chose this model in particular was actually the chatbot Arena leaderboard this is provided by LM CIS or large models CIS organization and here you can see that the moose Hermes is right here and it is again non for commercial use but it is still attracting B or 13 billion parameter model that is very high on this chart and it is available from Auto gptq so it is a quantized there is a quantized version of this model and at least in my evaluations this model was performing very well compared to some of the other models I've also tried wizard LM and then the Kuna and I've tried MPT 7B and none of those are actually were actually performing as good as the new crms so you might want to try it for yourself and probably you might want to try some of the other models as well so let's continue with loading the model first I'm going to get the current device in our case this is going to be actually the GPU device or cooler device and then I'm going to watch the model itself since I've already downloaded it this is going to skip this part and just what the model from the cache you can see that I'm passing in the complete but to the model and I want this to use the save tensor format also I want this to trust remote code and I want this to be loaded on the device which is again the Cuda device and I'm using the auto GPT queue which is essentially a library that allows us to what the quantized versions of those models and then I'm also getting the generation config which we're going to use to do the actual generation so this is pretty much what you need to pass in in order to watch your model and after this is complete I'm going to create our sample question which programming languages is more suitable for a beginner python or JavaScript we're going to ask our language model that and I'm going to convert this into the following format which is essentially the same that we saw in the home face Repository so this is the instruction which programming languages is more suitable for a beginner python or JavaScript all right so let's run the inference in this model or with this model and first you see that I'm passing in the pump to the tokenizer I want this to return by torch tensors and I'm passing the input IDs that are written from the tokenizer to the cooler device so I'm putting in the encoding on the cooler device and then I'm running the inference mode and within it I am running the generate method which accepts the inputs from the tokenizer and then I'm passing in some parameters I'm I want the temperature to be 0.7 I will need to change this in a bit and then I want at most to uh 512 tokens so you see that this is actually done in about six seconds this will I'm very dependent on the GPU that you have in the machine of course so let's see what the response has so you see that the response actually contains the original prompt and this is the actual response python is generally considered more suitable for beginners due to its readability and simplicity compared to JavaScript and you see that we have some special tokens that start and end the sequence and pretty much I would agree with that one actually yeah python seems like a better uh language for beginners uh let me know what is your opinion down in the comments below I'm sure that you might have another opinion uh let's see the generation config it is pretty simple we have beginning of sequence token end of sequence and padding token IDs so pretty standard so in order to use this watch language model we are going to create a pipeline that is from heavy face Pipeline and we're going to wrap this pipeline within a link chain hanging phase pipeline so we're going to be able to use the pipeline right within a chain in link chain but first i'm going to create this text trimmer which is going to essentially output the responses from the pipeline into our standard output in our case we're going to print the responses down into the notebook and here we are doing some special parameters we want this to skip the prompt so this will essentially remove all of this and then we're going to skip the special tokens the start and end of the sequence and I just want to not use multi-processing just in case sometimes I have trouble with the streamer if this was not set properly and then I'm going to create a text generation pipeline again this is coming from here in Phase how we passing in the model of the tokenizer the max land which is essentially the max length of the model 2048 tokens I want the temperature to be zero so I want this to be highly reproducible for you guys and then we are passing in the streamer itself and then a batch size of one so this will create the pipeline and it says that the wama gptq for casual or language modeling is not supported for text generation this is just a warning and it appears to be working just fine probably in the later versions of the Transformers Library this is going to be working much better and with our warning and finally I'm going to use the pipeline to create a hanging face pipeline in one chain and you can see that when I prompt the model we are actually getting the streamer to Output the result and this is pretty much how you do the modeling or the model in here so for the embeddings I'm going to use this again let me show you the leaderboard on the embeddings and this leaderboard is called massive text embedding Benchmark MKE EB leaderboard and here we have a lot of data sets total languages Etc and one of the best embeddings was this E5 base V2 so this is actually this one yeah I believe it is the same so this embedding is working quite well it is a base variant and it works very well compared to the watch and XL versions of the instructor embeddings which are other popular embeddings and yeah I'm going to use those one let me know if you have better embeddings for this use case and how your embeddings are working here I'm going to pretty much take these embeddings just because the retriever retrieval average is very high on these embeddings compared to the other models so this is why I essentially picked this one and this type of embedding is actually introduced into text embeddings by weekly supervised contractive contrastive pre-training and there is a link to the archive paper I haven't dealt dipped into that but again it appears to be working quite well and another thing about these types of embeddings is that you can use it right within the sentence Transformers Library so I'm going to go through the embeddings and I'm going to pass in the device so this will go download the model which is one gigabyte 1.1 gigabyte and it's going to put the model on the GPU and you see that it actually works pretty well even though we've already have the pipeline loaded into the GPU and if I go to the GPU at the current time let's see so we have about 10 gigabytes of GPU memory used just loading the models so we have about five gigabytes to use when do ink inference so next I'm going to pass in the Skyscanner directory and I'm going to take all of the text files here and watch those into documents and you see that we are getting 12 as a number of documents but I'm going to check this since some of the answers are very long and we have the limitations of two uh 248 tokens so let's get those texts split using the character text splitter and let's see at an example text where does the price sometimes change and you see that we have and this page contents within the document and then the source of the document is also available right within the document itself so next I'm going to embed the texts within the chroma database that I'm going to create so this will create an in-memory embeddings and if you run this you see that when we do similarity search for for example void search you'll see that the first question that we get is how do I search for flights on Skyscanner which is a very good I mean like when we're doing the similarity search for the chain this will is going to help us a lot So speaking of the chain here is the conversational chain that we are going to start with and I'll show you actually why this chain might not be the best use case for this chatbot and I'm going to show you what I'm using instead so let's have a look this is the template that I'm going to use so you're a traveling is a protection that is talking to a customer use only chat history and the following information so I'm passing in the context which is going to be the documents that are returned from the embedding and then we request that the chart support agent is going to be helpful and if it doesn't know the answer say that you don't know keep your reply short compassionate and informative and then I'm passing in the chat history also I'm passing in the input of the new question and then I'm asking for the response so let's see how this works I'm going to watch this template and I'm going to create a prompt template using the variables context right here the question right here in the chat history and I'm going to pass in the template itself so first I'm going to create this conversation buffer memory or which is going to take or represent the chat history and it's going to store the questions and the responses so here are the prefixes the input and the response are exactly the same format that we're using here for the human and the AI and then the output from the answer is going to be answer and the memory key is going to be chat history which is exactly this thing right here and I want this to return the messages so we are going to format everything correctly so to create the chain from all of this I'm going to take the watch language model which is again having phase pipeline I'm going to make this into a stuff chain so we are not doing any summarization or anything clever as of now and then I want to take the chroma DB as a retriever so it's going to act for retrieving of the documents using similarity search I'm passing in also the memory and I'm going to pass in the prompt template I want the return dock to return the source of the documents and I want this to be variables so let's hold this chain and let's have a look at the first question and the response from the chain and I'm also passing in the verbose walk so you see all of the data that is passed in to the model and you see that the context is actually provided from the similarity search and this is the response this case kind of helps you find the best options for fights on a specific date or in any day in a given month or even year our search algorithm Etc scans hundreds for tips on how to best search Bridge Village are search tips page yeah it looks quite alright and the response is quite good let's see what the answer contains so we have Source documents the chat history the question Etc and you go through the search documents you see the actual documents that were referenced when the module hit to create some questions and answers for the context which is something very good but something that is happening when you ask a second question about flight tickets but I can find any confirmation where is it and when you run through this you see something very interesting so the first problem that you're given here for the chain is given the following conversation and follow-up question I rephrase the follow-up question to be a standalone question in its original language so for some reason this chain is actually first rephrasing the question that you're passing in and then you're continuing with the response and yeah the the human and the assistant Etc it is provided as a history this is something that I actually don't want to happen and I didn't found an easy way to basically turn this off and uh instead of that I continued with the QA chain but I've added a buffer memory to it so in this case I am passing in the memory here enrolling the QA chain this is pretty much exactly the same thing but I'm essentially restarting the memory and let's see what I'm doing here I'm going to run this and this again verbose mod so the first thing that I'm doing here is to create the question then I'm manually finding the documents that are relevant and I'm passing in the input documents from those and then I'm also passing in the question so you see that the response is essentially the same but when I ask the second question I'm doing the same thing but the second question in here you see that the input and the response are just added as history and there is nothing that is doing the actual rephrasal something that we really don't want so you can see that this is actually returning a good response let's have a look please contact the airline or travel agent you booked with as case scanner does not have access to bookings made with Airlines or travel agents so again the response is quite good and I am doing the correct format that I want since this is something that I don't want to rephrase our questions the final piece that is going to glue everything together that we've won this far is going to be the chatbot quas which I'm going to show you how you can write on your own and here it is so I have the original template right here and I'm creating a course called chatbot which essentially takes a pipeline embeddings then a directory that we are going to use for the data prompt template and then whether or not I want this to be verbose so what this does is to create the prompt template using the template for the prompt and then it is the it does two things it creates a chain and then creates a database for our from our documents that we are passing in so let's have a look at the create chain it is pretty much the same that we had from the word QA chain I'm creating creating a buffer memory right here and passing into the chain and for the data embedding I'm also creating the water essentially the same thing that we did uh before loading the documents splitting them and then using chroma to create the embedding database that we want the final piece of this Quest is actually the the built-in or let's say magical method in Python call code call and this allows you to actually call this object from that is Created from this class by in the syntax sugar of a function if you will and here I'm just passing in the user input and this is what we are doing here I'm using the database to do a similarity search and then I'm running the chain through the documents return from here and the user question so exactly the same thing that we did before next I'm going to create this chatbot for you and this just creates the instance again creates the chain and the database so to use this model or chatbot we are going to enter this Loop and these are some warnings that were associated with the pipeline and it says that I'm using this sequentially so if you don't want to see those warning I'm also including this ignore of user warnings so the loop that we're going to go through is pretty simple I'm going to ask for input and if the input contains buy or goodbye I'm going to break from it otherwise I'm going to ask the chatbot for the input and then I'm going to get the result and print a new line after this is done so in order to paste this I'm going to start with Dot questions that we started with how void search works so this is the first question that I'm going to ask and you see that the response is stringed right here uh it works quite well actually so after this is done it is asking again for another question so I'm going to answer ask about flight tickets but I can find the confirmation uh exactly the same question as before and then I'm going to continue with another question yeah this is pretty much the same that we got so far I entered wrong email address during my font booking what should I do answers that Skyscanner is unable to help you with changing the email and you need to reach the air one or travel agent you booked with so yeah it is quite helpful actually and it works quite well alright so let's finish up this and let's say buy right here and you see that the process is now complete so this is pretty much how you're using your chatbot so this is it for this video we've seen how we can create a custom knowledge base and use link chain with an open Watch language model and open embeddings in order to create a customer support chatbot we've seen how we can create a pipeline from a hugging phase model then we've embedded our documents using the text files that we've prepared and finally we wrapped everything within a nice class called chatbot which is pretty easy to interact with and you can essentially pass in your own text right within this shellbot also we've seen how you can stream the responses so your users done doesn't have to wait in order to get their responses as soon as the watch language model is finished with a chunk or token from your model let me know down in the comments below if you want to see this type of chatbot actually the point behind an API so you can use it as a real chatbot in your production servers and I'll probably have to look into that and create a video for that as well thanks for watching guys please like share and subscribe also please join the Discord that I'm going to share in the description down below and I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 8,895
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: iGZ0cV-SRLI
Channel Id: undefined
Length: 26min 25sec (1585 seconds)
Published: Tue Jul 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.