hello again we are here with another video from llm 0200 Series in this video I'm going to focus on rag retrieval augmented generation and I'm going to explain rack and then we are going to design a chatbot which is called rack GPT so this chatbot is going to have multiple functionalities but rag is going to be the main uh side of our our chatbot for the chatbot I'm going to use four different libraries first is gradio uh that I'm going to use for Designing the user interface from from open AI I'm going to use the embedding model and GPT 3.5 as our language model and finally I'm going to use Lang chain and chroma for Designing the rack side of our chatbot so first I'm going to show you uh a demo of our chatbot and then I'm going to quickly go through the GitHub repository and uh show you how you can execute the the code yourself and I will explain uh some of the main techniques that are uh proposed osed by llama index and uh Lang chain for designing your rag system these are the three techniques that I found very interesting and then I will start the explaining the the project itself and how to develop that project a step by step by using the project SCH schema so this is the chatbot that we are going to design in this video as you can see there are multiple settings here but I mentioned that this chatbot is going to have three main functionalities first is uh is that the chatbot is able to connect with a vector database that we prepared in advance and that Vector database contains some of our documents it can be your or organization document your company's document Etc and we want to have Q&A with our documents then is uh the second feature is we can also upload a document while we are chatting with the chat bot and finally so the third feature is not technically a rag feature but I thought this is a nice addition to this chatbot which is we can pass a document no matter how many pages are in that document and the chatbot is able to go through the whole PDF and give us a nice summary from the PDF file so let's see how this chatbot works there are three documents that I've already prepared one of them is uh the paper for the clip model the other one is a paper for vision Transformer and finally there is a there is a third document that is lecture from Sam Alman about startups so let's first ask one of the questions that came up in the PDF for Sam Alman uh uh lecture so if I just simply submit that question we can see that our chatboard is able to connect with our uh Vector database and get the retrieved uh and retrieve some of the relevant contents related to our question and finally provide us with a nice answer uh to the question and on the left we can see the retrieved contents the most relevant contents to our question and actually if I also open the PDF I can see the PDF here and I can scroll down go through the PDF by myself if I need more information so let's ask actually another question uh for example how does vision Transformer interpret images so now we are asking question about a different PDF file the vision Transformer interpret images by applying a pure uh Transformer directly to sequence of image patches so it was able again to exactly uh find the answer to our question and again we are seeing the relevant contents to the question itself and then our language model was able to pick up the answer and provide us with the exact answer so let's change the functionality of our chatbot what I want to do is I want to upload a PD right now as we are chatting with the chatbot and see how we can have Q&A with our new document so I've selected two documents one of them is the paper attention is all you need and finally I have a uh I have a PDF file that contains three different stories uh F fictional stories uh about a bee a wolf and a fish so let's pass the stories to our chat bot so the documents the document is now ready and we can ask our questions if I ask who's Fred so Fred is one of the uh fictional characters in our stories and our chatbot again is going to search over the vector database and find out that Fred is a small red fish who lived in a cozy coral reef with his loving family so it again provided us with the answer from the document that we just uploaded uh here and finally the last few feature which I really like this feature actually and again as I said this is not a rag uh let's say technique but this is something that I've always wanted for myself is if I pass a PDF to our chatbot no matter how many pages are in that PDF the chatbot is able to go through the whole PDF and give us a nice summary so uh that paper has over 10 pages I think 15 and let's see how our chatbot is able to provide us with the summary of the whole PDF the whole uh p paper so it starts by an introduction the paper introduces a new network architecture called Transformer and then it also provides us with some of the results some part about the discussion the paper discusses the performance of different variations of Transformer architecture and finally uh I mean it's also giving us uh the results show that Transformer model achieved the state of the art performance and uh it finally is saying the paper also indicates a list of references related research papers Etc so it shows that actually our chatbot was able to successfully go through the whole document and provide us with a nice summary uh with the most uh important part like piece of information and this is something that I really like and I think it's going to be a very useful feature for this chatbot so this is a summary of uh the chatbot itself and a short demo of how the chatbot can work and actually you can also play with the temperature for the language model and uh see how it can affect the performance of the model however when we are designing a rack system my strong suggestion is to use temperature as zero because what we want is precise answers and the most accurate answers which I think uh a zero temperature uh is is the best choice for us so this is the GitHub repository if you go into the repository for llm 0200 you will see rag GPT uh uh folder and if you come into rag GPT you will find the explanation of the project all the uh functionalities uh that I just showed you along with the screenshot of the project and the project schema this is the schema that we are going to use today for uh developing our project then you will see some of the like some information that I think uh it's necessary like uh this project needs uh to needs uh if you want to execute the project there are two commands that you have to execute uh one command is going to execute serve. py and the other one is going to execute uh the app itself I will explain what is serve. py and I will explain the app uh while we are developing the project and uh I just want to give give you an idea of the GitHub reposer here and also you can find the key libraries along with the versions that I'm using here so you can uh easily execute The Code by yourself so first I want to explain what is rag in general rag has three main steps the first step of a rag system is preparing your database that's the first step it is not uh related to it can be totally offline and you don't necessarily need to do it while you're are chatting with the with the chatbot however our model as you saw has both uh capabilities to both work with a vector database that was already processed along with a data Vector database that we just created while we were chatting but this is the first step in any rack system to prepare your vector database and this first step includes uh some key steps by itself which is of course loading your documents cleaning your documents if necessary chunking the documents so I will show you what chunking means in a moment like in the next slide and then you're going to get the uh pass the chunks to a to an embedding model and get the embeddings and finally you are going to create a vector database this is the first phase of a rack system then after we have the vector database what we want is if we have a query how can we search over our database and find out the most relevant content to that query so that's what we are calling content retrieval let's say we have a query what we do is we pass that query to an embedding model and we get the queries embedding and then we apply a search with that query like the embedding version of the query on our Vector database so no matter what technique we are using we are going to get a score based on the our search of the query over each chunk and what we are going to do at the end is to retrieve the top K chunks that are uh that had the mo the highest score for our search and they are the most relevant contents to our query so as I showed you here in the chat bot if I open up the left side you can see right now I'm retrieving three contents which are the most relevant contents to the user's query which was who sprad in this case and finally we have the synthesis part so here is when the large language Model come into come into the come into the game so what we want to do is we want to prepare an input for our language model that input includes the language model instruction it also includes all the received contents depending on how many content you want to pass to your language model this is something that you will choose based on your needs and your Project's requirements and finally we want to also have the users's query so these are the three main components that a language model need in order to provide you with the right answer with the with the precise answer to your question however we can also introduce a fourth element which is chat history and this is something that you might want to pass to a language model or you don't depending again on your needs and your project requirements but we are going to actually introduce a chat history uh in rag GPT today and this is uh the input that we will uh prepare for our language model and finally what we will get is the response from the language model this is something that we just saw uh in the demo of the chat bot in The Next Step as I said I'm going to introduce three well-known techniques in the rag world I'm going to call them as they were called by the Llama index team uh so they are calling a conventional rag technique which was uh proposed by Lang chain uh as a basic Rag and then they are uh proposing sentence retrieval and also Auto merging retrial so let's see what do they mean first let's see how the conventional method or the basic rag works if I have a document and the first step again is to prepare this document and we are just going to focus on that side that's where the main difference uh appears so if I chunk my documents let's say into two chunks what Lang chain offered was okay so let's have these two chunks and also you know what let's have a overlap for our chunks so each chunk not only is going to contain some part of a text that it should contain but also it's going to contain some part from the previous chunk so this is how they try to somehow keep uh the performance smooth and provide some more content for the model in case two chunks that are somehow related are triggered and we are going to Pro provide both of them to the model the model is going to Now understand uh somehow the relation between those chunks and this is the the way that we are going to Al designed rag GPT today they are like in llama index presentation they are calling it basic rag but actually if you implement this technique you are going to see that this is a very powerful technique I've designed system with gp4 and this technique and I should say that I didn't even have the need to explore more and to explore uh other techniques to to improve my system because it was just perfect like if you have nice clean documents and if you uh implement this technique you are going to see that it is a very powerful one so what does sentence retrieval mean and what is the difference between this one uh this technique and the previous one so for Simplicity I'm now going to consider our chunks as simply sentences so my point is I'm going to take each sentence as a separate chunk let's say I have a question and after I search my question over my database my search result is going to provide me this answer as the one with the highest score what sentence retrieval suggest is when you want to pass this content uh to your language model we are not going to just pass that a specific sentence but what we are going to pass is actually this part so we are going to include some of the sentences which belong like some of the earlier sentences compared to this one and some sentences uh that appeared after the one that we just retrieved so this is how they are offering to provide the content for the large language model the whole concept is we provide uh content for our large language model in a way that it is able to extract the most relevant information and provide us with a better answer at the end and finally they are offering Auto merging retrievals so this is a technique that I found it actually very interesting and um this is something that I definitely would like to explore myself so what they are offering is let's say we have a document and we want to chunk it the way that we are going to chunk our document is again we are going to uh divided into multiple chunks but now each chunk let's call them child nodes each child node is going to somehow have a connection with some other child nodes and we are going to consider all of them in a bigger note called parent Noe so there are two scenarios that can happen let's say we have a query and we do a search on our Vector database and let's say that search only triggers child node one along with some other child child nodes from some other documents so that's okay that's something it's very similar to the first rack system uh the basic rag that we just described but what if we have a query and that query triggers child node one child node 3 and child node 4 so now is the interesting part we know that there is a parent node that contains these three Childs along with a fourth child that is missing in our in our retrieved content so actually what they are suggesting is you know what now that you want to pass these to a language model it is better to also include child node 2 the missing child to the language model because most definitely it's going to add more content and it's going to make it much easier for the language model to grasp the content to grasp to grasp the meaning behind all those chunks and to provide the user with a more accurate result so this is autom merging retrievals and as I said what I'm going to implement today is that conventional rag or the one that was offered initially by Lang chain as uh and as we can see that by itself is a very powerful method however I just wanted to uh explain you these two uh methods just uh so you know and you can think which one might be your best choice depending on again your requirements and your project needs this is the project scheme of rag GPT so this is what we are going to design today I'm going to divide it into multiple steps so it is easier for us to go through the project and uh it is also easier uh to implement it if we just see it in different pieces the first piece is going to be as I said data preparation data ingestion side and it is totally separate from the chatbot so we are going to design a pipeline that takes our documents and process them and creates the vector database for us so then we can point our chat Bo to that Vector database and uh have ask our questions from the vector database however as I said this technique I'm also going to include it uh in the chatbot itself so while we are chatting with the chatbot we have the option to upload a document and to create a vector database while we are chatting and to ask questions from the new document so these two have actually everything uh in common like all the steps are similar however the only difference is where the documents are coming from the first one are uh the first one is our clean nice company documents organization documents the second one is a document that I personally just want to ask questions uh on the on the Fly while we while I'm chatting with the chatbot after we prepared the vector database what we need is a chatbot that is able to take our question search on Vector database retrieve some content and finally pass everything to the chat bot so these are the steps that we are going to uh Implement for uh for the rag system let's say to be able to ask questions from our Vector database I'm going to ask a question my question is going to pass to embedding model then it's going to perform a vector search on my Vector database then some uh contents will be retrieved I'm going to prepare an input for my language model which includes the model instruction the retrieved content my question and finally uh a chat uh history a memory uh so the model can just uh provides a a better user experience for us at the end and finally the fourth piece is the summarization part so the way that I designed it is you pass a PDF to your chat bot what is happening here is that now we are going to divide the PDF for each page like we we will have a list and the list will contain each page of our PDF file then we are going to pass each page to a g PT model that is being instructed to just give us a summary of that page give us a summary while keeping all the key information then we are going to keep all the summaries in another list which is going to be the input of our final GPT model so the second GPT model here is going to see all those summaries and it is instructed to check all those summaries and finally what it does is it provides us with a final summary a summary of all summaries that gives us the best understanding of that PDF and what it does so this is how I designed this fourth piece however this is not Flawless it doesn't work if you pass a 2,000 page PDF to it so if you want to break this system it is very easy to break it and that was not uh my intention to design a system which is Flawless and can uh can work in any scenarios my main goal is first to provide something that is useful so in my case the PDFs that I'm working with are not above 50 pages they are mostly papers that I'm going to read and sometimes I'm just interested to see uh a nice short summary of the whole paper to see if I'm interested to go deeper and uh go through each page or do I just want to pass but if you decide to make this bigger and use larger PDFs or documents there are two main uh barriers that you have to overcome the first one is the context length of the second GPT model so if you have a PDF that is uh that has 2,000 Pages this summary is going to be a summary of each page and it is going to be 2,000 summaries at the end and it's not going to fit in your second model so I mean just thinking right now you what you might need is probably a a loop that starts summarizing uh those pages and probably then a chunk of pages uh So eventually you have a final list which is uh just long enough that you can pass it to the second GPT model but anyway you have to come up with a with a different solution the second issue that you uh that you will face is the number of API call calls that you can uh do per minute so again if you have 2,000 pages and and you want to do 2,000 API calls you have to check whether you you have the permission or you don't depending on what language model uh you are using so these are the two main challenges that you will have but as I said I just passed a 15-page PDF uh and it was able to provide me with an introduction uh what the pp the paper accomplished and finally some of the discussion and the result section so this is perfect for me and this is working right now however if you want to improve it these are two things that you have to consider so I just want to start explaining the whole project but before that let me uh first show The Project structure and then I will uh start explaining the project so this is the project structure if you open it up you will see a config folder that has app config yam file so in this folder uh what I will have is all the configs that my project uses as I said uh you will have access uh to the number of retrieved contents you will have you will you will be able to change the chunk size the chunk overlap size you will be able to change the temperature of the second or third uh language models if you remember so we have three GPT models in our system and for this one I'm going to provide a setting in the chat bot itself so you are able to change the temperature but for for these two I just put them in the config folder uh in the config yam file because eventually what I'm going to use is temperature equal to zero for all of them I have the system roles that you can play around with them and some other configs that we will go through them while we are developing the project then I have a data folder in data folder I will have docs folder which contains three PDFs as I said earlier clip uh paper Vision Transformer paper and EV and finally how to start a startup a lecture by some Alman so I'm going to prepare these three documents in our Vector database so our model has access to this process Vector database so we can have Q&A with the with these documents then there is a second folder called docs 2 so this one contains two PDFs that I'm going to use as my examples one of them I'm going to upload it while I'm designing the chatbot while I'm chatting with the chatbot and the third one I'm going to use it for a summary to get a summary from the chat bot in the data folder you will have a vector database folder this folder is going to be created automatically and what it contains is a processed folder and uh an uploaded folder so process folder contains the vector database from this one these F uh these files and uploaded one is going to contain a vector database from whatever PDF that you just uh give to your model while you are using it then I have images which I will skip it it's just some images for readme file then we have the source folder so in the source folder we have all our codes there are three uh modules that you will uh use for executing the chatbot this is something that I just want to use for the video then I will remove it when I want to uh update the GitHub repository and then we have utils which contains our codes for the back end of our chat bot so as you can see there are a lot of codes involved in this uh project my goal is not to actually sit and write uh the codes from scratch my goal is to take each piece of the schema and to walk you through that part of the project and explain each piece in detail so you then have a very good understanding of how this project was designed developed and you can modify it uh and adjust it based on your needs so the first part that I'm going to explain is the data ingestion part the part that we are going to process our offline documents and prepare a vector database for our system so let's see how it works I just removed the vector database in the data folder uh so we can recreate it uh while we are uh trying to uh execute that part of the code so let me actually open up a notebook so I can explain uh what is happening so what I need for this part is again if you uh remember the second slide I said the first step is data ingestion and what I need uh is first to load the documents so I need a function that load documents for me then what I need is a function that chunks the loaded documents so this is the second function that I need the third function that I need is depending on on how you're designing your rag system is to get the embedding of those chunk files uh chunk uh documents so I need get the embeddings and finally what I need is a fourth function that creates a vector database and save the vector database somewhere for me so create and save the vector database so this is what we are going to implement for this first part of the project before I move on I have to just explain one more thing so the way that a vector database work is I just provided you with a simplified version verion of a vector database so you just have a good understanding of what is happening in that Vector database is that when we have a chunk so these texts are referring to our chunks when we have a chunk and we pass it to the to the embedding model and we get the embedding what we want to store in our vctor database is not only the embeddings but also the corresponding Chunk in the embedding in the vector database so it means that if I have the embedding of a chunk I have the text next to it so while I'm doing the search on my query I'm going to search over the vector datab uh over the embeddings and I explain in in a different video why we are doing it and how Vector search can actually preserve the semantic relationship within a sentence or a paragraph or a piece of text and if you are not familiar with the concept this is a very fundamental concept that we are going to use for Designing our rack system so I strongly recommend you to watch that video it's a short video I just explain the embedding side of the project but what we want to do is we want to search over the uh over the embeddings but eventually what we want to retrieve here is actually this text side of uh our Vector database why because we want to pass it to a language model and our language models they understand text they don't understand uh embeddings so what I'm going to receive is actually the corresponding text to the embeddings which got the highest score based on my search result so this is just a short uh description of a vector database so these are the four four steps that I want to implement for creating that Vector database and if you go into the utils folder and if you open prepare Vector DB what you will'll see is a class called prepare Vector database and this class is exactly implementing these steps so what this class is receiving uh is first the data directory which is this folder so you can just copy paste your personal documents and use them for the chatbot then it also receives a persist directory uh which is the directory where the vector database is going to be created then we have the embedding model the chunk size and the chunk overlap and if you check the config here you will have access to the chunk size chunk overlap and also if you're using a different EMB model you can Define it here then it creates the text splitter so text spitter is something that Lang chain uses for creating those chunks and they are offering multiple like different uh text Splitters you can see uh one is called character text splitter the other one token text splitter Etc but the one that I'm going to use for this project is recursive character text spitter is the one that I used and I think it's uh working just fine and I I I I will use this one for this for this chat bot then I also have my embedding model so this is the first function as uh we discussed this function Works in two different ways so what I want is first this part the blue part so I have some offline documents and I want to create the chunk and I want to create the vector database I also want the second part which which is while I'm chatting with the chatbot if a user uploads a document what I'm going to receive is in gradio is the directory the full directory of that document so I just simply get the directory of the uploaded documents it can be one or more than one document and I'm going to Chun them chunk them and go through the same exact steps as we are going to have for the offline documents so the only difference in this function is the way that it's going to receive the document directory if I just provide the directory and point it to our data folder it is going to create a vector database so we can use immediately when we open the chatbot and if I upload a document while I am chatting with the chatbot it gets the directory it creates the the chunks and it goes through all the steps that we want to create a new Vector database for us so this is the second function as I explained chunk documents what it does is it creates the chunked documents from our PDF files so before I move on let me actually show you how that looks like so I have a test uh notebook here again this one is also not going to be included in the repository this is just something that I will use for the video but what I want to show you is the output of this function and this function how do they look like so both of them are going to give us a list of documents the first function is going to give us a list of documents which is something like this so if I have a document let's say in this case two pages I'm going to get a list and each page is going to have its own uh piece so if I just simply uh prints the first element in my list I'm going to have this topple here and that by itself has two components I have the page content and if I print in I will get the text from that specific page and I will also have a metadata which contains the page number and also the source which is the directory of that document so I'm going to use these the page content to to create the chunks and I'm going to use metadata for providing the reference and all those things on the sidebar just to provide the reference and make it interactive for the user so it can also uh access the PDF file so I'm going to get a list of docs we just saw how it looks and then I'm going to chunk it the chunks are going to exactly look the same uh but the only difference is uh now our documents actually it's not going to be two pages in that case it's going to be more than two depending on the chunk size and the chunk overlap and finally what I'm going to do here is to create the vector database so one step that is missing here is passing our chunks to an embedding model and it is happening behind the scenes L chain just made it easy for us so if we just pass our embedding model and the directory where we want our Vector database to be created along with the chunk documents it is going to create it and save the vector database for us automatically so if you decide to design your own rag system using uh a different library or a customized version of a rag system that's the step that you have to take pass your chunks to an embedding model create the embeddings and finally create a vector database that contains the embeddings along uh along with the corresponding text and just save it somewhere on the cloud or on your uh local PC finally I'm going to save the vector your database and along the way I'm going to also print some useful information for us to see what is happening so the number of loaded documents number of pages and then the number of chunks Etc so this is the first part of the project that you have to execute before running the chatbot this is the vector database that needs to be created so the chatbot has access somewhere to some documents so you can start chatting if you don't want to upload a document or you don't want to summarize a document and for ex exting this class here what you need to execute is this module here this module is just simply instantiate that class and uh creates the vector database and save it somewhere so let's see how it works if I just execute this class now our vector vector database uh was created you can see a fold was added here in the processed folder sub folder and here are some of the information that we printed we just loaded three documents these are the three PDFs here total number of pages were 83 all the PDFs together then we chunk the PDFs using this config here so our chunk was our chunk size was 1,500 and the chunk overlap was 500 so the number of chunks was 341 and finally we just uh created and saved our Victor database now our chatbot is ready to chat with our documents so I just explained this part of the project this part is exactly the same the only difference is I'm going to just uh upload the documents and gradio is going to provide me with the directory to the uh documents that we want to uh process then the next step is to create a pipeline that enables us to chat with our Vector database so I will go through the pipeline first quickly which is I want to have a query that query needs to be uh passed to an embeding model then I want to perform a vector search then from the vector search I want to retrieve some Contents I want to prepare an input for my language model which contains the retrieve contents actually in this step I'm going to skip the memory site I just want to simplify it so it's going to contain retrieve content model instruction and my query then we expect to see the result from the model so for Designing this part before I design it on gradio what I want to do is first to design it uh in an interactive way uh with the terminal so I just want to create that whole pipeline but I want to execute it on the terminal so what I need here is actually let me open and mark down so first I need to have access to a large language model before anything to an LM and an embeding model so you have to if you're using open AI you have to load all your all your credentials I need to have access to the directory of the vector database I need to get the user Square then I have to get the embedding then I have to perform a search using the uh queries actually embedding and the vector database then I will retrieve some content again this is uh a config that you can Define here so right now I'm retrieving the top uh contents with the highest score uh based on my search result and then I'm going to prepare the model input which is the query plus the retrieved content plus the model instruction and the instruction of that model is here so this is the instruction that I uh designed for this chatbot this is a very important part of your chatbot you probably need to massage it a little bit to get the best performance out of it depending on your language model and your use case but right now this works just fine for us so I'm saying telling the model that you're a chatbot you receive a prompt that includes a chat history we are skipping it for the for this part then the retrieve content from the V your database uh based on the user's question and the source your task is to respond to the user's new question using the information from the vector database without relying on your own knowledge so this is a typical way that we design rack systems we don't necessarily want GPT models or any language model that you're using to use its own knowledge because there's a great chance that they hallucinate and that's something that we are avoiding in our systems we want the most accurate response to our question and then you are going to receive a prompt with the following format I'm also providing the model uh schema of the prompt that it should expect to receive and we are going to design our prompt our input exactly as it is here so here in op uh actually in terminal Q&A what I'm going to do is exactly create what we just wrote down here so if I keep it here as I said I'm I'm first loading my uh llm and I'm also uh loading the system role the temperature which is zero in my case I'm passing the uh Vector database directory and I'm also uh loading how many uh how many content do I want to retrieve then first I want to instantiate a an instance of the chroma Vector database along with my embedding model then I'm going to design uh this Q&A a step by step as we uh wrote down here so first is I'm going to get actually this is a clear okay so this is a second function that I'm going to use but I will explain uh while I'm building the process here so first I'm going to get the user's query which is uh I'm just simply asking the user to ask a question in the terminal this is going to be the user's query and I'm going to uh prepare it as I explained to my language model then I'm going to perform a vector search a similarity search using my question on my database again if you want to design your own rag system a customized version you probably need to implement few more steps here but this is how uh things are simplified if you use Lang chain and for that I'm going to use similarity search then I'm going to retrieve some documents but before I pass the documents to my language model I'm going to clean up the the documents a little bit so depending on based on my experience actually I uh think some documents might not be as clean and the other ones and in my case the documents that I'm providing right now to my language model like when I'm retrieving the documents they are not in plain text in a way that is human readable or in a way that is readable for the language models so what I'm doing here is I'm taking the documents which is a list and I'm starting to go through all the retrieved contents and I'm cleaning them so if you check my cleaning function you can see that I've addressed quite some techniques here some uh issues here so these are the uh special characters that I found out in my documents and uh these are the things that I'm cleaning up here actually if your documents are clean and if you are using with more General like if you're using the chatbot with more General documents this is a step that you might not need but it is uh a good practice to print out uh input of the model before you passing it to the model and make sure that everything is just uh as it should be and it's just human readable so the model can actually understand what you're passing to it so I'm cleaning the documents and then I'm just providing uh the model with the uh input uh that it is expecting to receive and I'm also printing it so just to see how it is receiving the input so this is exactly as we uh defined here let's see how it works so if I just execute python Source terminal Q&A so we will see I'm also printing the length of the vector database and right now uh it's asking me to ask a question so if I just simply ask explain the architecture of the vision Transformer it is now going to look back at my documents search a document based on my question retrieve some content as you can see the contents are now cleaned and pass it to a language model and this is the response that we are receiving right now the response is the architecture of the vision Transformer is a pure Transformer applied directly to sequence of image patches it does not rely on conventional networks convolutional networks so as you can see it just provided us with an explanation of the architecture of vision Transformer and that's something that we wanted so we just designed the pipeline that connects us to our Vector database here so this purple line is now designed the next step is to design a gradio app and connect that pipeline to gradio and also be able to uh see everything in a nice uh user interface for that what I want to do is first to show you the part of the gradio that is just for for the designing purposes so if I open up rag GPT app if you let me close this one if you check this part so from here up to this point this is where I'm just designing my user interface there is nothing happening uh in this part of the code except our user interface is going to be designed which I'm going to show you in a second and everything that is happening in the back end is happening from here so let's see how how it looks like if we just simply design our user interface so for showing you the user interface I also need to import gradio as gradio like as gr then I need UI setting to also implement the feedback so if you open utils you will see UI settings here and there are two functions here Tuggle sidebar and feedback and these two functions are going to uh just help us to design the sidebar on the left for showing the references and also giving a link to the PDF and also the feedback button so right now I'm not using uh the feedback button in any Cas like in uh for anything but you definitely can use it uh use the feedback button to store the user's feedback and to improve your system if you are designing it for your or organization so I'm going to also execute this class here so right now if I execute this code and if I have demo. launch what I'm going to see is yes I also uh have assigned Avatar images to the user side and also to the response so I will just skip it for now so this is our user interface this is just a simple skin a surface of uh our interface it is not connected to any back end and now it's ready actually to be connected to the back end and so we can have interaction with our chat bot so let's see how it [Music] works so if I open my app as I said again here up to this point is just the skin of the user interface and this is where uh all the back end is happening I have two main functions one of them is the respond function and the other one is process uploaded fights depending on the functionality of the chat bot if I go into the respond function which is something that we uh we want to see how it works right now the respond function is going to take the chatbot which is a list of Q andas while you are in the chat session and you are interacting with the model so this chatbot list is going to also include your historical q&as and this is something that we will use for preparing the model with the chat history it also received the message the new query from the user it receives the data type so this is uh what type of functionality our model uh needs to have right now is it going to be connected to our process Vector database or do want to just upload a new file or do we want to uh ask for a summary then we also have a temperature uh you will have uh temperature setting uh for for this language model that we are going to use right now but I personally we use it uh as zero then I have the data type if it's Pro uh equal to uh pre-processed document or if it's equal to upload doc process for rack so what it means is if it is equal to pre-processed Doc it's going to get connected to this pipeline here to the vector database that we design using this blue pipeline if it is uh equal to upload doc uh upload doc process for rag it is going to uh get connected to the vector database that we created using this pipeline so this is how we switch between two Vector databases and we can have uh our Q&A with different documents then what I'm going to do is first I'm going to uh instantiate the vector database and uh also I'm going to check if actually first I'm going to check if the vector database exists if not I'm going to ask the user to upload the vector database before uh running the chatbot and then I'm going to exactly do the the same as we did in this module the one that we created the interaction with the chatbot in terminal I'm going to exactly do the same here the only difference is right now I have gradio waiting for me to return the results to it so it can print it for us so I get the response from the model I also append it to my chatbot again this chatbot is coming from the gradio side so this is uh what gradio is going to use for printing the user's query along with the model's response and I also have access to retrieve content because I want to print it nicely on that left side bar before I show you the chatbot there is one more element that I need to explain so we can then uh see how how our chatbot is now interacting with our first Vector database and that is serve. piy since I was developing my uh project locally one of the challenges that I had was how I want to show you actually the documents so there are multiple ways that you can uh show the user the documents if you wish to do that so if you are using a different user interface for example chainet chainlet is already providing you with a nice sidebar that you can uh print your received content and you can also show a document on that side sidebar in gradio I couldn't find that feature So eventually what I came up with was to create my own sidebar and I design that markdown sidebar so I can print the references and I can also uh put a link because I couldn't show a document in it but what I did was I included a link at the end of each reference and that link is going to point to Port 8,000 which is the port that I'm going to use to host all all of my documents ments so this is what this class is doing this class is checking these two folders and it's going to create a port and it's going to just keep all the documents on that Port so if the user clicks on any of the references it can see the documents live on that Port that we just created and you can also change the port from the config uh file let me also do a little bit of clean up here so if I want to to start the chatbot and start chatting with my Vector database first I need to execute serve. piy so now my documents are ready live we can also uh use the reference site and then I need to execute python Source rag GPT app python I have a missing in here so we have a user interface we just created a function that returns that chatbot and the retrieve content to my uh user interface and let's see actually how it works so I have a few questions that I want to test for each document let's see if we can pick up the answer that we are looking for so for the first document I want to ask how do you feel without burnout why still being productive and remaining productive this is a question that uh I found it on the document itself so let's ask the question this is the answer to our question it was successfully uh picked up the answer from the documents from the retrieved contents and it is also showing us uh the reference that we just uh we just uh asked a question from actually if I try try to find that question I think here it is how do you deal with burnout while still being productive and remaining productive so this is the question that I just asked and our search uh result was successfully pointing to the right document and our model was able to pick up the answer from the document the answer is dealing uh is to acknowledge that it sucks and it keep going unlike a students who can accept lower so actually if yes that it sucks and you keep going and unlike the students this is the exact answer that we are looking for let's ask another question from the vision Transformer so I'm asking the model to explain the vision Transformer architecture and this is the explanation of the model we saw earlier in this video this is just a nice explanation of the vision Transformer model uh it is B based on Transformer architecture that's I think that's actually what they uh riter said in the in the PDF file so which is commonly used in NLP tasks vit applies a Transformer directly to sequences of image patches Etc so this is perfect answer again to our user's question let's ask a more detailed question like how does vision Transformer interpret images Vision Transformer inter separate images by applying self attention to sequence of image patches and actually if you look at the description in the PDF file you will see exactly that's how they are explaining it if I open up here so there are three big chunk of documents that was retrieved based on my question but the model was able to pick up the answer again uh from these uh three chunks and provide us with the precise answer to our question and also again if I view the PDF file I can see my Vision Transformer PDF here let's ask a question about the clip model so what I want is the model to explain the architecture of the clip so the architecture of the clip model is not mentioned it's in the retri content however based on the information provided in chat history clip stands for this one that's interesting right so the model was not able to pick up the answer from uh the retrieve content which is something that I expect especially when I'm dealing with GPT 3.5 so the the amount of information that we are passing to the model actually probably it's a good time for me to show you the input of our language model if I check the input of our language model I have a chat history which contains the previous two q&as I have three chunks that I'm retrieving for the chat bot and I'm also uh giving it my uh new question so as you can see this is a huge amount of information I mean it the chatbot is still able to pick up the answer in like most of the cases most of the scenarios but sometimes it gets a little bit confused this is an issue that I almost never had with uh gp4 especially if your documents are clean and if your chunks actually contain the answer that you are looking for gp4 is way much more powerful than GPT 3.5 however I just wanted to take this chance to also uh show you the input of our model this is exactly how we uh we we told the model that we are going to process the input and we are going to pass it to it so right now what is saying is clip stands for uh contrastive language image pre-training it's a model that combines natural language processing and computer vision to perform image classification tasks Clips uses the vision Transformer architecture which self attention to say oh this is amazing like actually it was able to provide me with the answer with the correct answer so this is just fantastic it was a little bit confused but eventually it was able to pick up the answer and if I uh open up the PDF we can see that all our PDFs are now pointing to the clip paper and this is how we can design a rack system to talk to our documents let me ask one more question a more detailed question how does clip model process image and text the clip model processes both image and text by using a combination of NLP and computer vision techniques uh it l leverages Vision Transformer which applies self attention uh this allows the model to interpret images by analyzing their internal representation such as learned embedding filters and position embeddings so this is fantastic right so we just was able to point our chatbot to the vector database that we created earlier and I start asking questions based on um the documents that we had the thing is as you saw the choice of language model embeding model uh llm instruction the way that you prepare your input and uh the language models prompt everything like how clear your question is these are all very important factors that you have to consider when you are designing a rack system you have to start playing around with it actually just by designing this project I understood a lot more about how effective is massaging the language models input prompts and how how effective was changing the chunk size and the chunk overlap all these uh configs are playing a a very important role in the performance of the system so I strongly recommend you to start playing around you use different uh GPT models use different uh configs for data preparation different documents you can use different type of documents to see how it performs on different uh like different in different domains these are all the things that uh eventually you have to test in order to be able to design a nice uh functional uh chat bot so let's jump to the next step so what we did up to this point is we created our Pipeline and we created a vector database then we I also showed you the same pipeline uh how it works if we upload a document and we just had a Q&A with the three documents that we processed and we uh kept in our Vector database so the second step is let's upload a document and see how it performs if I upload a document so if I upload a document right now let's just upload a document which is not a technical paper it says if you would like to upload a PDF okay yes of course I have to change the models the chat bot functionality so right now I want to process my document for rag if I upload the document it's going to be processed and a new Vector database is going to be created on the uh data folder so if I open Vector database now I see this up loaded folder was added and it contains a vector database that contains the vector uh the embedding and the text from my new PDF file so in my PDF file there are three stories uh three fictional stories to three fictional characters one of them is a bee one of them is a wolf and one of them is a red fish so let's say uh tell me more about the red fish the red fish in the story is named Fred he is described as not just an ordinary fish but one with scales that sparkle like it just was able to pick up the story and it's start to uh walk through the story and give me everything about Fred one day Fred and delie stumbl Upon a that points to a hidden treasure located in a distance part of the ocean that's amazing right so again it was able to pick up uh the answer from the document that we just uploaded uh while we were chatting with the chatbot and these are the two ways that you can create your rack system however there is now one last piece in our project which is summarizing the whole document so this is something that I explained earlier I also explained two challenges that you will face if you want to uh expand this part and if you want to improve it if you want to use larger documents but let's see how uh it works and how I designed it for this project so let's close it here so if you open summarizer here actually let me first start from here so if you uh come to this part you will see that I'm taking in the drop- down the functionality that the user is requesting for so it is either pre-processed doc or it is upload doc process for rag or it is upload doc give full summary so this is what we are going to implement here and this rag with drop down is going to be passed to both my uh processed files and also to the respond file uh to uh to the respond function so if I uh go into process uploaded files what I see here is I'm using rag with dropdown to Define so I'm going to upload a document and that uploaded document might be used for process uh processing uh processing for rag or it want uh it might be used for the summarization so here is what I am checking based on the user selection if the user was selected process it for rag I'm going to process it for rag if the user was selected Ed upload the document for and uh give a full summary I'm going to provide a full summary of the document so again when the user press the upload button what I'm going to receive is a directory to that document but there are few more things that I'm going to pass to my function to get the summary I will explain uh here so this is the summarizer class that we are going to use for that purpose but before that let me explain to you what is happening so when the user presses the upload button what I'm going to get is the directory of the document that the user wants to summarize then I'm going to load the document and I'm going to load it in a list and that list is going to contain each page of the document so it's going to have page one it's going to have page two and until the last page then what I'm going to do is I'm going to pass each page to a GPT model and ask for summarization so I'm going to create a for Loop in which I'm going to pass each page to a language model and in return I'm going to get a summary of that page so here eventually what I will have is another list but now instead of page one page two until the last page I'm going to have summary of Page One summary of page two until the summary of the last page so summary of Page One summary of page two Etc then what I want to do is to pass this list this new list to another GPT model so now the new GPT model is going to provide me with the summary from the summary that it received so I'm going to pass this pass the second list to a large language model and what I'm going to get at the end is a summary of all the summaries this is what I'm going to do for that class however I just added two uh layers of let's say security two things just to make sure that uh my class is not going to throw an error the first one is while I want to create the first summary from my pages what I'm going to do is I know that GPT model or my language model has a context limit so I know there is a context limit and I know I have X number of pages in my PDF file so what I'm going to do is to ask the model to use the context limit that I'm passing to it for example here I am using 3,000 as the so like the maximum context limit that I want to get at the end so if I divide it by the summary of each page what I'm going to do is to tell the model that Summarize each page but your maximum token limit is the result of this division so this is going to be included in the llm summary and if you can see it here I'm say I'm telling the second model that you are an expert text summarizer you will receive a text and your task is to summarize and keep all the key key information keep the maximum length of a summary within a range that I'm going to Define based on this value you you can take uh uh you can change it based on the language model that you're using and here I will make sure that the second model that I'm going to use here is not going to receive a context length that is uh that is more than uh the permitted length so that's one thing that I just want to make sure and the last thing is just to make it a little bit more safe I'm going to add a threshold so you have two options here to play with uh the first is maximum token limit and there is a token threshold so since I know that actually this maximum token limit right now is sufficient it's not going to hit the limits especially with the documents that I personally use my ma my token threshold is zero but that token threshold is going to be subtracted from the result of this division just to make it a little bit safer so you can have a limit which is not exactly on the edge so the model is never going to hit that limit however this is not the main challenge that you are dealing with if you are going to pass a large PDF file so make sure to to handle uh uh to change the code so it can handle those scenarios so let me show you the summarizer here as I said I'm going to get a file directory which is the file directory for the PDF file that the user just requested I have the maximum final token which is this one here it is 3,000 so I'm going to divide that 3,000 by the length of the document and I'm going to ask the uh language model that you know what when you are summarizing each page you shouldn't uh surpass that limit I have a token thres threshold I'm going to subtract it here from that division just to stay on the safe side I have the GPT model the temperature is going to be zero llm uh system role I have the second system role for my second model so if you remember I have two language models the first one is uh for summarizing each page and we just uh I just show you how we want to also pass it the maximum token limit for each summarization and I have a second GPT model that I necessarily don't need to suppress that model and ask it you have to stay within this limit actually what I need is a comprehensive summary of that document so I need as much key information as uh it can possibly uh provide for me so the second GPT model is going to have a different system role which is you are an expert text summarizer you will receive a text and your task is to give a comprehensive summary and keep all the key information this is how I'm going to uh ask the second GPT model to behave for me and finally I have the character overlap so this is something that I forgot to tell you so one thing that I want to do is to pass each page to the first GPT model so it can summarize it for me but what I'm going to do here actually I will call it 2.5 what I want to pass my page one let's say my page two to the GPT model what I also want to include is a little bit of uh like the ending part of the page one so the model can have a better understanding and can have a better understanding with the connection ction with the previous page especially if a sentence was missed if a sentence was not finished properly Etc so again this is not something that you would say it's it it's the best practice it can solve all my problems actually you might want to use Lang chain chunking system or you want you might want to design your own summarizer based on uh the way that you want you might want to actually as well input the chat history so the model knows what it summarized earlier these are all the techniques that you want uh you may want to actually implement but what I want to do is if I have page two I will also add ending part of page one to that page two so the model sees a little bit from the previous page as well so that's what this character overlap means and by character overlap I'm just including the last 100 100 character uh to my uh page and also I'm adding a little bit from the next space so you can think of this as a both uh like a both way overlap it's going to have some overlap with the previous page and it's also going to have some overlap with the next page so what I have here is I'm checking if I'm in on the first page I don't do anything except adding uh a little bit from the second page if I'm on any page except the first and the last I'm going to add that character overlap from the previous page and the next page and if I'm on the last page what I'm going to add is just a little bit from the previous page so this is just some somehow again how I you know I designed it to to make it a little bit more understandable for the model and it doesn't miss any sentences or it doesn't see any half sentences without understanding what it was about then what I do is I just go through all the documents create this input and pass it to my model and create a long text from all the outputs and keep it uh keep it for my second model finally what I'm going to pass to my second model is a full summary which was created from the first model and I'm going to pass it to my second model to get the final summary so the way that I'm creating this full summary is actually not a list but you can have a list if you want to process it first what I wanted to do was simply have a list like have have a string that contains all my summaries so I can pass it to the second model and get the final summary so this is what this summary uh summarizer class does and the function that you see here is just simply a function for uh sending the AP API call for the GPT model and uh receiving the response from it so now now you have a good understanding of how this system works and how I design this system so let's see it actually in action so again I will pass two documents the first one let's pass it the stories that I just mentioned before that I have to also change so there are three stories in this document and what I want to see is Tre summarization from all the stories because I don't want to miss any information and here we go I have the first first story about amarok and alone wolf I have the Second Story which is about Fred the smallfish and finally I have the third story about L A young be so the model went through three pages of a document it realized that there are three different stories so I have to get the user a piece of each one and it gave me a very short summary of each story in the response let's pass the paper attention is all you need so this is a 15 page PDF if I remember correctly what I want to see is a piece from the beginning till the end is an introduction a nice introduction the body of the paper like what they accomplished and eventually some conclusion or some sort of a discussion at the end of the page and this is actually what the model is going to provide us so the paper introduces a new network architecture called Transformer It Started from the very B like first step of the paper the introduction part it achieves these are scores which so the model was actually able to pick up the results the model also generalized as well to other tasks such as English constituency parsing the paper discusses the architecture of the Transformer model including stacked self attention and fully connected layers the paper discusses the performance of different variations of Transformer architecture the results show that Transformer model achieves the state of the art performance and finally let's say what it says at the end the code used for the training and evaluation is aail a on GitHub the paper also includes a list of references so our summarizes was able to go through all the pages of this paper and provide us with a short nice summary of the whole paper it also provided us with some of the key results which is fantastic and we have our summarizer in the chat bot so we just saw actually how to design all these pieces and make it a chat bot in gradio user interface and start to uh talk to our Vector databases to chat with our personal documents to upload a document while we are working with the chat do chat chat bot and chat with it and eventually how to pass a document and get the summary from the whole document here I just want to wrap this video up by uh just mentioning some of the important parts that you have to consider in your project so as I said model selection both the language models and the embedding models model is very important language model instruction is super important how you instruct your language model llm configs temperature token limits Etc all the configs that I showed you in the yaml file the context length uh for the llm instruction the query retrieve content and the chat history is something that you want to uh keep a close uh eye on it because it's very important so if you are planning to design a system for a large number of users this is very important to keep an eye on along with the number of API calls that you can have per minute selection SL designing the rag technique which rack technique do you want to use I mentioned three different rack techniques we just implemented the basic one however we just saw that how powerful this basic rack technique is but anyway if you want to design your own system you might want to consider different techniques and uh see which one works best for you eventually chunk size and overlap if you go with the basic rag child and parent size parent size if you go with the auto merging Rag and sentence window size if you go with the sentence retrieval rag these are the configs that you want to keep in mind and you want to start playing around and see uh which one provides you with the best performance in the last slide I just want to mention some of the deployment considerations in case you're are planning to design a rack system for your organization so one of the things that you have to keep in mind is uh the memory and privacy so probably you would want to have a memory both to keep the record of your model's response along with probably users queries also users feedbacks these are the information that together they can help you to improve your system however uh the user queries and the model response probably are some of the sensitive contents that you would want to encrypt those uh information and you just want to pull them out in case something happens but the user feedback by itself is going to be very beneficial for you to improve your system over time then you want to have a database and data data flow management process uh in the system let's say if you have 2,000 documents and you want to add one document to your vector database the last thing that we want is to create the vector database from scratch but what we want instead is to have a data flow Management in place that can actually just add that one document or let's say remove that a specific document from the vector database so this is a very important thing if you just want to uh have a better uh Pipeline and database management then is testing so testing is another important part of your system you have to do a lot of test just to make sure that your rack system is performing as expected and it is suitable for your uh tasks then is security uh so let's say data leakage and authentications are two examples under the security domain so just want to make sure that the documents that are being put and they are uh being process processed to be uh added to the vector database they are they both have accurate content in them and also they will be uh like the employees who have access to those documents have the authorization to access to those documents so these are two of the things that you have to keep in mind the next uh thing is special cases and special scenarios so if you have two type of documents let's say a general document that is easy to digest and is easy to understand and a rack system can do very good on those documents and you also have very technical documents for your organization so a rack system might not be as good with those technical documents as it is with General documents so you might probably have to come up with different solutions to address those special cases and special scenarios probably F tuning a language model or some other Solutions then you of course need a deployment pipeline for both data ingestion and the chatbot you have to consider latency scale computational power server price these type of things that belong to the let's say designing the infrastructure of uh your project these are very important aspects of your project it is important how you want to deploy your model and where you want to deploy your model if you want to use cloud or if you have to use uh if you want to deploy it on Prem these are different different things that you have to keep in mind monitoring and logging are very important you just want to keep observing your project As you move on and feedback just mentioned it uh next to memory and privacy so these are some of the key uh aspects if you want to uh deploy a chat bot uh and uh design it for uh to be accessed by uh a large number of users and uh that was it for this video I hope now you have a good understanding of how rack systems work and now you can actually design your own rag chatbot and in the next video what I'm going to do is to design a chatbot with a streamlit and the chatbot the chatbot feature is it is able to decide whether to provide the user with an answer from its own knowledge or does it uh it can also search the web and provide the user with an answer uh based on uh the search results on the web we are going to make it happen by leveraging uh a feature in language models called function calling and it's going to be a very interesting project I hope to see you in the next video