Create Your Own ChatGPT with PDF Data in 5 Minutes (LangChain Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to be showing you the fastest and easiest way that you can create a custom knowledge chat GPT using Lang chain that's trained on your own data from your own PDFs I've seen a lot of different tutorials that have over complicated this a little bit so I thought I'd hop on and make a fast and to the point version that you're able to copy and paste my code and get started with building these custom knowledge tools for your business and for your personal use as quickly as possible now if you're familiar with applications like chat PDF where you're able to drag and drop in a document and start chatting over it what we're going to be building today is essentially the exact same thing you're going to be able to take that functionality and put your own PDFs in and then use it for any purposes that you like but the best part about what I'm about to show you is that this method is going to give you complete flexibility and customization over how your app works and how the documents are processed now just quickly I'd like to plug my AI newsletter which is launched recently now if you want to get all of the hottest and latest AI news they're still down to a quick five minute read and delivered to your email then be sure to head down below and sign up to that firstly we'll be going through a very very brief explainer on how these systems work and the different part involved so that you can understand what we're building here and how it all works and then secondly we're going to be jumping straight into the notebook that I've created for this video that you're going to be able to copy and paste over to your projects and just change the name of the PDF okay guys here's a quick visualization of how this is actually working under the hood so this is the system we are creating using Lang chain which is essentially going to take in our documents chunk it embed it put it in a vector database and then allow users to query it and get answers back so I'll take us through this step by step now so the first step here is to take a document and split it into smaller pieces now this is done because when we are recalling it and querying the database in order to get an answer based on the document we need to receive a bunch of smaller chunks that are relevant to the user's query and not just the entire massive information so step one here is to chunk it we're going to be doing it in 512 tokens or less so we're going to chunk our document down into however many chunks needed in order to get below this 512 tokens per piece and then what we're going to do is take the chunks and embed each one of them one by one so we're using the adder002 model by open AI which is by far one of the best embedding models available right now then we're going to be able to take all of these different embeddings for each chunk and put them into a vector database so that they're ready for recall when the user queries then the final step is to allow users to actually query the database so we do this by taking in the user's query we put it through these exact same embedding model that we do over here and then you query the database based on the embeddings of the user's query so we get back a number of documents that are most similar to what the user is speaking about and then we're also able to pass that around to a large language model and include it in the context so we take also the user's query and then the match documents combine them together and ask the language model hey can you answer this question given this context and then we're able to send the answer back to the user so that's a very quick high level overview of how these applications work now we can jump straight into building it now at the top here we've got a summary of all the different steps we're going to be going over so you can take a look at that but we can jump straight into these installs and imports now simplified it all down so you guys can just run these cells as you go through so you can run that you need to run this cell here which is going to install all of the packages my API key is already set up you need to replace this with your API key and once those are all installed you're ready to get started now for the purposes of my chat bot in this video I'm going to be using attention is all you need which is the Transformers research paper that was done by Google so I thought it'd be interesting to use this within the chat bot so here we can see I'm using it here attention is all you need.pdf all you need to do is come if you're using a different document when you clone this notebook you can go over to the left side panel here and drag in your document and upload it once you've got it uploaded you can come back and change the name here so replace this with the name of your PDF and then you're ready to go the first main step we have is loading the PDFs and chunking the data with Lang chain so we've got two different methods here that I wanted to show you one is the very easy and straightforward version that Lang chain offers which is just using this simple page loader using pipe pdfloader and that's just going to take the PDF that you've given it it's going to chop it into pages and then you're going to get all of those pages as documents ready to use in your in your system now this method is great if you're doing a quick test but I thought I'd show you a more advanced method which is going to be splitting up your documents into roughly similar size chunks now there are a number of different factors that go into creating a customized chatbot system like this and the chunk size is actually one of those and it can determine a lot in terms of the quality of the output so this script we have here is going to allow you to split it by chunk and you can actually set the size of the chunks here so I've got it at 512 at the moment with that overlap of 24. now the first step in this Advanced chunking method is to use text track and text tracker is going to extract all of the information out of the PDF and save it to this Dock and then second we're going to need to save it as a text file and then reopen that text file now this is just to get around some issues that can frequently come up depending on the documents you use so we save it to a new text file and then we reopen that text file then you need to actually create a function that allows you to count the number of tokens so here you can see I've used a gpt2 tokenizer and then we've just made this little function count tokens this is going to take in some some text in the form of a string and it's going to return the number of tokens so this tokenizer here actually counts the number of tokens and then finally we create the text splitter which is this Lang chain type called recursive character text splitter takes in a chunk size which is variable as I mentioned and then we need to put in the length function which we've just created here so final step is going to be creating the chunks objects by passing in the text that we got up here that we've opened up from our text file passing it into the create documents function and then that's going to create all of the chunks and type land change schema document now one quick best practice that I want to show you guys is actually to do a quick visualization of the distribution of the chunks to make sure that this chunking processor has done it correctly it's done it to the correct size that we've mentioned so if you just run the cell you don't need to know the specifics of it but this essentially shows you the distribution of these different chunks so we've got a couple that are over the limit but that comes down to this recursive splitter so on the most part we don't have anything that uh thousands and thousands of tokens they're all roughly within the range that we wanted and then we need to create our Vector database which again Lang chain made super simple with this faiss package and we're going to take in the chunks that we created and also there's some bidding model and then it's going to embed all of that store it in the vector database and then we're going to get this DB variable back out again Lang chain makes this super simple we just need to set up our query which is who created Transformers and then all we need to do is run a similarity search on the database using the query and then we're going to get that back so and there we go so if you put this little bit in here at the bottom which is LIN docks you can actually see that this is uh based on this query it's actually pulling back four different uh chunks that match the query so that's going to give you an idea of how much context is actually being grabbed from the vector database with each query then we essentially take that functionality that we've just created and combine it with a lang chain chain chain which is going to take in a query so we can do the same thing who created Transformers we're going to retrieve the docs and then we're going to run a chain and that's going to take in the query and the docs and then it's going to give us an output so that is combining the context that's being retrieved from the similarity search with the query and then answering it as you'd expect it to so if we run this who created Transformers it's going to do that similarity search bring in the documents then also take the user query and then say okay let's run a language model on this one of openai's language models to answer the question and here we have the answer now I thought I'd throw in a little extra goodie for you guys here which is to convert this functionality into an actual chat bot so I get this a lot in my videos like yeah you showed us the functionality but how can I actually use this into kind of chat bot so this is just a quick one that I've worked up if we run this this is going to be using the another land chain component which is going to be this conversational retrieval chain which takes in a language model and it's going to take the database that we created and use it as a retriever function so you don't need to know too much about it but just run the cell and then here is a little chat bot Loop that's going to allow us to interact with this knowledge base in a chat format so here I can go who created Transformers and there we have it it started to answer us were they smart we have a custom knowledge chatbot using Lang chain that takes in your own PDFs chunks them up embeds them creates a vector store and then allows you to retrieve those and answer questions based on that information and this does have chat memory included into it as you can see here who who created Transformers gives a name were they smart I don't know so here you can see that the chat memory is actually working you have a customized chat bot with chat memory that about wraps it up for the video guys thank you so much for watching all of this code is going to be available in the description for you to clone this notebook change the PDF out and start to use it for your own purposes now if you've enjoyed this video and want to see more content like this be sure to head down below and subscribe to the channel I'm posting tutorials like this all the time and if you've enjoyed the video please leave me a like it would mean the world to me now as always if this has lit up some light bulbs in your head and you want to have a chat to me as a consultant you can book an accord with me in the description and in the pin comment so if you want to see some feasibility reports or talk through an idea with me you can reach me there and I also have my own AI development company so if you want to build something out like this but on a bigger scale for your business or for personal use then you can have a chat to me as a consult then we can see if we can help you get that built and finally in the description and pin comment there are also links to join my AI entrepreneurship Discord and to sign up to my AI newsletter which is all available down there so that's all for the video guys thank you so much for watching and I'll see you in the next one
Info
Channel: Liam Ottley
Views: 68,455
Rating: undefined out of 5
Keywords: LLMs, prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, long chain tutorial, langchain, langchain javascript, gpt-3, openai, vectorstorage, chroma, train gpt on documents, train gpt on your data, train openai, train openai model, train openai with own data, langchain tutorial, how to train gpt-3, embeddings, langchain ai
Id: au2WVVGUvc8
Channel Id: undefined
Length: 9min 15sec (555 seconds)
Published: Tue May 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.