How to chat with your PDFs using local Large Language Models [Ollama RAG]

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video I'll show you how to build a local rag system that doesn't connect to the internet using olama and python so there's so many reasons why you want to build a local rag system over using something that's connected to the internet so maybe you have very sensitive documents like your medical documents or your financial documents so security is very important there you don't want to share those very detailed personal stuff on the internet because you never know which llm is going to be trained on so the best way for you to do is be able to get models that you can use offline using AMA in this case and I'll demonstrate how to do that how you can feed those personal documents to a local model that's not connected to the internet and still be able to chat with those documents especially PDFs in this case and make it very easy without having to upload it to a public uh online system that will be able to use your information for training so let's get started all right so let's get to it so this project right here as you can see on my left side here I have PDF files and this could be files on your system uh that you save maybe your resume maybe some lecture notes some books or whatever that you want to uh chat over using the llm and in this case these are documents that the llm wasn't trained with so this is data that llm has never seen potentially and you want to be able to chat with that and uh the benefit of this like I mentioned before was these are private documents that maybe you don't want to share with you know using public LMS that things that chat GPT because they might you know they will have that data uh they have an option to have that data and be able to use it and you don't want to be able some maybe to see some of your sensitive data and so all this will be happening locally in your own system and this is the entire pipeline as you can see so we'll start from here you load your PDFs using unstructured PDF loader and this is longchain and most of the functions we will use here and the modules will be really longchain modules so longchain if you're not familiar with longchain is an AI python framework that helps you build AI ABS by they obstruct a lot of the work that needs to be done in terms of like loading files using llms and so many other things uh combined as well like chunking and things like that so they make it really easy by abstracting that way so you just focus on building the app and makes it easier as well so in this case here we're loading these files using longchain uh unstructured PDF loader and unstructured is actually a company unstructured .io and their product is they will let you load any type of file uh to be used by an llm and that's really cool so here longchain wraps that and use that we use that once we load the file it's able to extract all the content from the PDF files and then what we'll do after that we'll also use another another L chain function here that will go ahead and recursively um split all the characters so basically we will do chunking for us and we'll have a strategy there of how many uh characters we wanted to chunk and also overlaps as well and then once the chunks have been uh chunked we'll put them in a variable and then we'll iterate right through them and then we'll get the we'll use nomic embed text and this is a embedding model and there's so many other embedding mods that you can use but for this case I chose nomic because it's just it's just fast and it's one of the popular ones and just as well as a note everything in this pipeline right here can be replaced so you can switch this with any other ones that you prefer so this is not like a you know this is the B kind of thing but no this is just to get you started and know which ones you can you can swap so you can swap a bunch of this stuff uh there's so many options and then after we've embedded we've created vect embeddings of those chunks we then load them to chroma DB and as well again you can use chroma you can use wva you can use Melvis there's so many of them are there but I'm using chroma in this case then once you load them to chroma The Next Step would be to query it and retrieve whatever you want to ask of of that data right so the user as a user you'll put in a query you'll say hey maybe you know what is this document about or what what's you know summarize this for me so what happens here in this case I'm also using a multiquery retriever module from L chain and basically what that does is optimize your question more so basically you might ask you know summarize this for me and this multiquery retriever what it does I told it hey generate five more questions based off this user question and it will go ahead and generate questions that might be varied differently what differently stated differently and that way when it sends those five questions to the to the to to the vector database to be able to retrieve the closing close enough context that matches those questions it it returns also like other another five answers for that as well right so that way at once those questions come back we get a union of all the questions and get a a short summary of like probably the best answers of all the five and that's what gets sent to the llm for it to generate an answer for you and that's just a strategy to make your answers better in this case for Rag and there's so many strategies for this uh piece for actually there's so many strategies there's things like agents which is something that is recently gaining popularity but I'm using this multi retriever here to make it a little better and see if if it does better than just kind of regular just retrieving it without doing this and so once those questions have been generated then it gets passed to the llm with the prompt which your question and then the context will be those summaries and then the Gemma here is a model that I'm using locally but in this case um I swapped it with l too and you can swap it with whatever local model that you have here from AMA and then once it's done that it give you back the response uh to you right here as a person on the other end so let's see how the code looks like right so let's jump over to the code and so what I've done here already I have three different sections here that we'll go through so the first one is ingesting the PDF uh the code for ingesting the PDF and then I have the code for creating Vector embeddings of that PDF and then retrieving which is asking questions over of it and getting answers back uh to you as well so let's start with ingesting PDFs so what we want to do here first we want to install unstructured and L chain and also install unstructured all docs and this all docs basically covers everything to include PDF and text as well and then the next thing here we want to import we want to import un structured PDF so here we installed it but now we want to import it so we can use it within our code here and then we also import online PDF loader and what this online PDF loader does it gives you an option maybe you know if you don't want to if you don't have these files locally you can access it by downloading it from from the internet where there's already a PDF hosted there one thing to remember though if you don't have access to that pdf online like maybe you might click it on your website and you could on our website and you can see it but when it comes to downloading it it doesn't let you do that it will throw like a 4 or3 error which basically means you're not allowed to access it so keep that in mind so it might just be best to just download and then use it and uh there's also lots of different loaders within L chain that you can use that might do that better than uh this one so I downloaded this file from online I think it's from world world economic forum and it was a recent one from January and I figured hey these llms were trained a while back and so they won't have this information so what's best way to demonstrate how you can add context to llms by using rag so this is the best way to do it give them something that they haven't seen they haven't been trained on so this is the file that we'll be going through and I think it's about how many pages it's about 26 Pages um so not too big right um just a good for beginning uh understanding uh rag how it works and implementing it so cool so the next step here what we will do next is load this file so it's pretty easy so if local path and basically I'm just saying if this local path is you know has something you know or the path to it I want it to load and I'm using a structured p uh loader and it's loading that I'm passing that as a loader and it's loading that and then once that's loaded let's just verify here this is just verifying the first page and we just previewing hey uh how what does the that data look like so we can see this is already loaded so it definitely just took went ahead and we can verify that as well so it says in collaboration McKenzie and Company the global corporative barometer 2024 so let's see and that should be page one let's see if that's exactly what's on there all right so yeah you can see in collaboration with McKenzie uh the global Corporation barometer 2024 okay so we it has the content that we're looking for so cool all right so we're done with ingesting and we verified that hey we did ingest we can see some data which is cool all right so now let's go to the next step which is Vector embeddings so we want to embed this text and basically we're converting it from Human readable to computer REO it be like zeros and ones and uh more like vectors right so these are something I want to note so for Vector embeddings you need to also P you use a vector embedding model and in this case I chose to use nikin bed and if you have Ama if you don't have Ama in your system the quickest way to do it is just go to. a and then from there and we can go here right now I can show you how it looks like so you can go to ol. a and you can install this you just download it here you click download and you download a new system and install it and if you also have Windows and Linux I think it's on preview right now and uh it's it's really easy to to install they have like some other instructions on how to do that right now and uh once you've installed it the next thing would be for you to uh pull models or install models within your system right so there's a list here so if you come up here to models which is ama.com Library uh you will see all the files all the all the models that are available for you to use locally so we have Gemma Lama 2 mraw mixt draw command R um you have lava so many of them right here right so so many of them here to play with so we're going to be using noric embed text and as you can see here is a high performing open embedding model with a large token context window yes so that's the key part has a a large context window that we can use and that's the key part right there so you can take in a lot of uh tokens that you can then load it to our Vector database so here I've already installed this so I don't need to install but this is how you would do it on a v on a not a vector database on a collab node notbook and then once that's done you do list and that way it will show us all the models that are within your uh local system so if you have if you just installed noric up here if you do AMA list it will show you if it's there already and you can see here it's already in my system and I have some other ones as well that we will use down the road and after that the next thing would be for you to install chroma DB which is our Vector database that will store all the vector embeddings here um and then the next thing you want to install is longchain text Splitters and this is what we use to chunk our text all right so the next thing here we do is import or Lama embeddings and then also import recursive character splitter and then chroma so the first step here we're doing split and chunking so we'll split the text and uh the chunk size that we'll split here which is chunking really here we'll do 7,500 uh chunk so basically every character 7 after 7,500 characters will cut there and then that'll be the first one and then we'll do another 7,500 cut it that's the next one but there's always a problem right so if it does cut a 75 might cut in the middle of the word or in between words that are actually making sense and then by the time it stops the other one it might not connect very well when it comes time to retrieval you might not get the accurate answers that you're looking for so it might hallucinate or just not really get to understand the full context so we're introducing here chunk overlap basically to kind of take some text before like in between each one one of them will kind of have like a middle portion where we pass in between each so that they remember oh this was the other context from the other line and this was from the other one so that way it's there's an end and there's a coherence so that way it's we're just not kind of lost in the middle here uh avoiding the problem of Lost in the middle a little bit so it's not 100% foolproof you can play with these numbers uh these numbers are arbitrary you can just adjust to find uh the best that kind of work for you as well and another thing as well you can uh fine-tune your embedding model and thus Advance at this point based on your data to make it better uh so there's so many strategies here to to use but this is just basic basic one to show you how this is done so you can get going uh and and working with this so the next thing would be to once you've done that would be split the documents and here I'm just passing the data that I wanted to split and we're passing in the data that we already loaded at the top which is our PDF and it's going to do the splitting into chunks and then next thing is to add that all those chunks into vector database but first thing what it does it takes every chunk and then it converts those to vectors and then it loads them to the vector database and so basically here what we're doing we're saying from documents chroma hey hey chroma from my documents please taking the chunks that will be the document and then embedding we'll do AMA embeddings and we'll pass in the model that we want to use as our embedding model which is nomic embed text which we did download a while back ago and then we also wanted to show the progress so while it's loading the files we wanted to show the progress when it's working through it um so that way we're just not seeing like empty screen for a long time not knowing what's happening and then we also want it to be in a collection called local Rag and local rag you can think of this collection name as more of like a database so in your in your database you have like tables and that could be one of them you know you're putting in a table specific table all right cool now we're done with Vector embedding so just to preview a little bit what we've done we've loaded our text up here we've then gone ahead and created Vector embeddings over that data that we just loaded from the PDFs and then we load it into uh chroma DB for Vector database so the next step would be to retrieve you know you ask it a question now and then you get responses back so there's a structure here of how we do this with longchain and longchain has chat prom template and then prom template we'll use those here in a second we have string out Parts as well and then we have chat or Lama which will be the local mod that we want to use and then we have runable pass through which will be our question and then we have mul layer multiquery retriever which is what I talked about generating five questions and that's what we'll get here into uh here in a second so first let's instantiate what model we want to use so here the local model that I'm going to use in this case to as on my llm will be mistol you can change this as well you can put it to llama 2 or whatever ones that you will download maybe your system has a better enough memory or you have a a bigger system a GPU or something that can run some really good models here real fast so we'll pass that to chat AMA and we'll inserate that as our model that we'll use here so I'm going to be using mraw I found mistra to be really good so uh feel free to experiment with others as well and then we'll create a variable here query prompt and we use a prompt template here and basically this is what I was talking about you know you the input variable will be a question so it will take in your question and it will generate five similar questions based off your question and uh from that those are the questions that will be refered and we will be used to ask to pass those as queries to the vector database for the vector database to retrieve answers with those five questions and at the end of that we will get a union of all the answers you got from those five questions and that will be the best probably answer to our question uh essentially so that's what it does here we just kind of passing in the question and then it's going to generate that we're just giving you the instruction here and how to do that um and then after that we're doing the retriever so we're creating our retriever here and we're doing multiquery retriever from llm and we're passing it in with we're going to the vector database which we already created up here and uh and already passing the text into that Vector database from our PDFs and so we're saying hey Vector database we're going to use it as as a retriever and the llm that we will use here will be mraw and then a prompt query this is the prompt query that we'll use hey we want you to generate five questions uh that will be used all right cool and then the next step would be to create a rank rag prompt and the rag prompt here in this case is just saying hey answer the question based only on the following context the context will be what we get from the vector database and then the question will be our question that we asked in the beginning right the query and uh after that will that will be our prompt so as our prompt to the llm will be hey um will'll be from template we just that pass out as our template as well so here we'll create a chain so a long chain has this chains uh that you can use that makes it easy to chain things together so we have the context will be our retriever which we've set it up here already which is multi retriever that will get all those five questions and give us the context back and then our question we'll pass our question to the run that we pass through and then after that we'll pass in the prompt and our prompt up here will be what we already made here which is hey answer questions based only in the following context right and then we'll pass it to llm which will be mistol in our case and then it will output the answers to us right here so that's the chain over there so we'll run that and then we'll get here and we'll just do chain invoke and this is where we as the question so I was a little bit extra I wanted to use input and in Python if you do input brackets something in there it basically gives you the option to you know opens up a input box which you start blinking and then you can type in a question in real time and then send it um you don't have to do that you can just check out this input and use brackets and I've shown down here how to do that as well and so a quick note here the reason why I'm I already that this this already has been ran before I actually recorded this video and the first go through and it actually competed with the system in terms of resources and it broke the video the video disappeared so wasn't able to save it which sucks learned a lot there but uh it it consumes a lot of uh resources trying to run this and if you have a GPU the better for you but I don't have that so I had to run this first and then come record now so but essentially here what I asked was what is this about basically what is this document about that I just have in my Vector database and it says hey this document is the insights report of the global Corporation barometer 2024 by the world economic Forum in collaboration with McKenzie and Company it provides an analysis of the state of global coroporation across Five Pillars trade Capital innovation technology climate natural Capital Health and Wellness and peace and security the report examines Trends in corporative action uh and their outcomes to the Prine overall level of global Corporation in each area it also includes recommendation for leaders on how to reimagine Global coroporation in a new era so that's sounds pretty right you know and it sounds like a summary as well so this is another way of like guess me asking for a summary and I think it from what we discussed a little bit earlier it took what is this about and probably generated five different questions based off a variation of those and asked the same question and that's why it sounds like a summary because you know I would think one of them was like Hey generate a summary of this and that's why we got this really nice summary the next question I asked it was what are the five pillars of global cooporation and first of all let's just go to those five Global coroporation questions here uh the answers that I expect so the global cooporation parameter Five Pillars of global cooporation so the first one is trading capital and then we have innovation technology and then we have climate and natural Capital Health and Wellness and then peace and security so let's go and see if it got that right right so the five pillars of Global Security Global Corporation are one tradeing Capital innovation technology climate and natur climate and natural Capital Health and Wellness and peace and security yeah I think it did pretty good it got it all right all right so this is just a simple example to show you right like how you can do this really easily and the most important thing is most of this stuff are swapable so you can even use llama index you can just not do it with uh longchain and you can do it llama index or you can do it just raw don't use any of them and just do it by yourself right although you'll you know you have to manage a lot of things you have to build a lot of uh functions to do some specific things for you like extracting PD F and they might not even be consistent so there's so many things in there to consider and also there is the agentic things that are happening now that you know you have agents in between maybe before you embedded and also you have them you know within embedding you might even assign an agent to each document um so that way you have one agent for each document that you can query so there's a lot of I guess techniques and strategies being used right now for rag that are coming up or being improved and so this is just a basic example to get started I might get into some of those especially the agents ones cuz I'm really interested in agents uh very recently I'm really interested in agents so that might be something I might do in future and make a video of right so let me know if you like it by the way but uh for this one specifically I just wanted to show you how you can get this done real quickly and none of this is connected to the internet you can do this offline literally you can just you know so long as you've imported everything that you need to import up here you can just log off you know turn off your Wi-Fi and you can run this base basically on your system so the next step that I will do after this and I'm working on it is converting these to a streamlet app where it's very easy user friendly cuz for some of that doesn't code this is probably very scary a little bit cuz like what's all this code like I don't understand so I'll convert this to a streamlet app where you can just basically you have an upload button you upload a PDF file and it does there embeddings in the back it and then you has a section where you put in your question and then you just chat with that PDF basically and uh this should be able you should be able to spin that at least locally on your system this is not going to be on the internet as well but it takes all this code that seems like you know very scary and uh it gives you an interface where you can interface with it I also want to note that uh Llama Or Lama has a web UI as well that you can use but the reason why when I try to use and stream it is because it helps you you know you could you could just customize it the way you want it to maybe you don't like the way the web UI is built with AMA you want to build your own thing or you know it's fun anyways right to learn and build something new but that's all I wanted to share for this today please let me know if this something that you enjoyed and also let me know as well what you would like for me to see what you like to see me cover in the next videos and with that happy coding and see you in the next video bye

Info

Channel: Tony Kipkemboi

Views: 27,653

Rating: undefined out of 5

Keywords: Python, Data, AI

Id: ztBJqzBU5kc

Channel Id: undefined

Length: 22min 59sec (1379 seconds)

Published: Mon Apr 08 2024