Build a PDF Document Question Answering LLM System With Langchain,Cassandra,Astra DB,Vector Database

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is krushak and welcome to my YouTube channel so guys in this video we're going to create one Amazing llm Project which is nothing but PDF query with langin and cassendra DB uh cassendra DB will be probably creating in a platform which is called as data Stacks so if you have probably heard about this particular platform which is called as data Stacks which will actually help you to create cassendra DB in the cloud itself and why this platform is quite amazing because from this you will be able to perform Vector search and whenever we talk about this kind of documents or if you want to really create an Q&A applications from huge PDFs itself Vector search is the thing that you really need to implement now before I go ahead let's first of all understand the entire architecture we will be solving this entirely step by step what are the steps specifically that will be taken to probably complete this specific project that we really need to understand so let's begin with the architecture initially let's say you have a specific p PDF this PDF can be of any size and any number of pages first of all we will read the documents and understand here we are going to use langin as I said because langin has some amazing amazing functionalities which will actually help you to perform all the necessary tasks to create this specific application now first we will go ahead and read the documents that is specifically the PDF and the first step usually when we whenever we work with this kind of data set is with respect to some kind of transformation we really need to do so after reading this documents we will convert this into various test chunks that basically means we'll split the data set into some kind of packets right so this text Chunk will be of some specific size based on the tokens that we are probably going to use so over here you can see some example reading the document and then we have divided this into some chunks then we will convert all this chunk into text embeddings now for here here we will be specifically using open aai embeddings okay open AI embeddings actually helps you to convert text into vectors now why you specifically require these vectors I hope you have heard about text embedding techniques in machine learning right there we have specifically used bag of words tfidf and many more things that is already present in my YouTube channel we have also used word to work average word to work what are the main aim of all these techniques to convert text into vectors because once we probably convert into vectors we can perform various tasks like classification algorithms like similarity search algorithm and many more so that is the reason we will specifically be using openi embeddings which will be responsible in converting a text into vectors itself now once we convert every text into vectors we will also see this as text embeddings once we get this embeddings what we are specifically going to do now this will be quite amazing because understand if we have a huge huge PDF document right so definitely the vector size will keep on increasing so it is better we store entirely all these vectors into some kind of database and for this we are going to use Centra DB so in short what we are basically going to do is that we will take all the specific vectors and save it in a vector database here currently we are going to use cassendra DB now what exactly is cassendra DB so in order to understand about cassendra DB I have opened the entire document ation page over here cassendra aperture Apachi cassendra is an open source no SQL database and it can definitely be used for saving massive amount of data so it manages massive amount of data fast without losing sleep right so again understand this is a nosql database and for vectors kind of thing we definitely have to save it in this kind of database itself many bigger companies are basically using this cenda DB for this specific purpose so if you really want to read more about AIC cendra you can probably see over here pachic cendra is an open- Source nosql distributed database trusted by thousands of companies for scalability and high availability this is the most important point for scalability and high availability without compromising performance linear scalability and proven fall Tolerance on commodity Hardware or Cloud infrastructure make it as a making it as a perfect platform for Mission critical data now how we are going to probably create this specific database for that we will be using this data Stacks platform wherein it will actually help you to create this Vector cassendra DB so that you can store entirely all these vectors into this specific DB and at any point of time if a person is trying to query from this particular DB you will be able to get that specific response from that right and the most similar response that you'll be able to get it now that is the next step what we are basically going to do all these vectors we are going to save it in some kind of vector database as I said we going to use cassendra DB or we can also say astrab and this will be created in this data STS platform wherein you can actually perform Vector search now the next thing is that after you probably save entirely all your vectors in in the database itself then a human whenever a human tries to query anything that is related to that particular PDF document it is going to probably apply similarity search along with text iddings and is going to get that specific response so this is the entire architecture that we are specifically going to perform in this specific project all the steps will be shown step by step everything will be explained in an amazing way along with the code and along with the explanation now let's go ahead and start our specific project for this PDF query with Lan chain and Cassandra DB so guys now let's go ahead and implement this specific project I will be going step by step I will also be showing you how you can create the cassendra DB uh specifically in the data Stacks uh platform itself uh we'll be seeing step by step all the comment regarding this code and all is given over here I will also be providing you the code in the description of this particular video so first of all uh what exactly we are doing we are going to query PDF with astrab and Lang chain uh it is basically powered and uh understand it is powered by Vector search so first of all you need to understand what exactly is Vector starch so there is an amazing documentation that is given in the data tax documentation itself so Vector search enhances machine learning models by allowing similarity comparison of the embeddings embeddings basically means whatever text is basically converted into the vectors that is basically embedding right and over there you can definitely apply multiple algorithms right machine learning algorithms on the fly right as a capability of astrab vector search supports various large language models so large language models can be is very is supported in an amazing way in this the integration is very much smooth and easy right since this llm are stateless they rely on Vector database like astb to store the embeddings see understand because uh when we say stateless that basically means what suppose if we have embeddings once we lose it we cannot again query it right so it is definitely require a database to probably store all these things and what you can do after that you can query any number of time so let us go step by step and let us see okay so first of all we need to create a database on astb so I will probably click this specific link everything is basically given over here for this we will be going to astra.com right so first of all it will probably ask you to sign in right and here you can probably sign it with your GitHub or with your Google account so here I'm going to go ahead and sign it with GitHub and probably once I probably sign in over here you will be able to see that uh I'll be providing you the link along with everything in the uh code itself right so it'll be very much easy for you so once you go to Astra data.com the next step is basically to create a database right so this database uh what kind of database we are going to probably create it will be serverless vector and this is specifically a cassendra DB okay so here I will probably give my database name let's say I want to do PDF query right so this will basically be my PDF quer DB okay this will basically be my database name you can give anything as you want and here I'll basically will be giving Lang chain DB a key space name it should be unique the provider that you can specifically use you have multiple like Amazon web services Microsoft Azure but here I'm going to probably use Google Cloud which is the default that is selected in the next step we will go ahead and select the country region which is by default Us East one so as soon as you probably fill all this details and as you know that we are specifically going to use this Vector database itself because at the end of the day the algorithms that we are probably going to apply it will be easy with respect to this kind of database right so finally we will go ahead and create the database now once we create the database you will be able to see that my database is basically created over here right so this is what is my database that looks like right PDF query unor diving now if I probably go to my dashboard I've already created this kind of database a lot so let me consider one database which I have already created and over here some important information that you really need to take first of all I will go and click on connect okay so when I probably click on connect one some information you will definitely require one is generate token right and the other one is the DB ID so DB ID is basically present over here right the token is basically present over here now I'll talk about where this specific information will be required okay so here I will go with respect to my code now let's start our coding initially we will be requiring some of the important libraries like cash data set langin open a Tik toen so here I will go ahead and execute it and I will go ahead and install all this specific libraries so it will probably take some time right I have already done that installation so for me it has happened very much quickly now the next thing is that as we know that we are specifically going to use cass. DB so in Lang chain you have all these libraries which will actually help you to connect with cassendra DB and perform all the necessary tasks like text T meetings uh creating vectors and probably storing it in the database itself so here I'm going to probably import all these libraries from lin. Vector stores. cassendra I'm going to import cassendra along with this I'm also going to use this Vector store index wrapper it is going to wrap all those particular vectors in one specific package so that it can be used quickly then I'm also going to import open AI because open AI is the thing that we really need to use along with this we are also going to use open AI embeddings which will be responsible for converting your text into vectors along with this if you want some kind of data set from hugging face you can also use this and one more important library that we are going to use is cashio now Casio actually helps you to uh probably integrate with the astrab right in Lang chain and it'll also help you to initialize the DB connection so all these libraries we are going to use I'm going to execute this step by step we going to probably see and this is the first step installing the libraries and initializing all the libraries that we are specifically going to use along with this what we are going to also import is one PDF which is called as Pi PDF 2 this will actually help you to read any PDF uh read the uh text inside the PDF itself so this is one amazing library to probably use okay so here I have basically used pip install Pi PDF to so let me just go ahead and execute it and inside this you will be able to see it shows requirement already satisfied because I've already installed over here then from PI PDF you're going to use PDF reader because this will be the functionality that will be used in order to read the document here is the document that I have specifically taken so this is one budget speech PDF so this is the Indian budget that is probably of 2023 it's a big document with somewhere around 461 KB file it has around 30 pages so I'm going to specifically read this PDF and then convert into vectors store it in the database itself and then query from the database anything that you have any information about that particular PDF now let's go ahead with the setup okay now with respect to the setup you require three important information one is the astrab application token one is the Astra DB ID okay so where you can probably get this two information so go to your vector database uh Vector database in the data Stacks so here uh is what you have specifically logged in okay as I said inside your DB just go and click on connect here you need to click on generate token as soon as you probably click on generate token then you will be getting some code which looks like this this token you will specifically go getting so this will be probably found in your token Json file so it'll probably show you a Json file which will have this information okay the first information that you have over here is the Astra DB application token so here you can probably see it starts with Astra CS so what you need to do just click on the generate token and you'll be able to see it this is the first information you just need to copy and paste it over here the second information is Astra DB I ID right Astra DB ID is nothing but this specific information that is your database ID where do you get it you just need to copy it from here so this is the information with respect to your astrab ID so this two information once you do it you paste it over here I've already pasted it and then you can also see that I've also used some open API key and this specific API key don't use this only because I made some changes I've already executed the code also okay so I'm going to take this three information this two information is is basically used to connect to your Astra DB right which has a cassendra DB hosted over there in the cloud right and the other information is basically to use the open AI API features right so all this information is basically there I'm going to probably execute this and then we will go ahead and read our budget speech PDF so this is the first step according to this we are reading the specific document before that we have initialized everything with respect to this okay so once we specifically do this I will probably be reading this now after reading as I said we are going to divide all our content into some kind of chunks right so here is what chunks we are basically going to do now first of all I will read all the raw text so for this I'm going to use from type extension using concatenate I'm going to read from each and every pages I will extract all the text so here you can probably see for I comma page in enumerate PDF reader. Pages page. extract text Will basically take out all the text text from those pages and it will concatenate in this particular variable that is rawcore text so once I probably execute this what will happen is that you will be able to get all the text inside this particular variable so you can probably see over here rawcore text has all the entire text so this is the entire text from that specific PDF right slash in basically means new line so this step is basically done just imagine before if we did not had this specific Library it was very difficult to read a PDF right and we have actually done this just with writing four to five lines of code now the next step is that we will initialize the connection to your database I have all my database information right like uh token ID and a database ID I'm going to use that cashio cashio is basically used as a library over there for initializing of this particular database so cashier. init here I'll be giving one parameter which is called as token which will be nothing but astrab application token and then your database ID which is nothing but astrab ID right so I've taken this two information I will execute this you'll get some kind of warnings so don't worry about the warnings it is just like it is showing you some kind of warnings okay with respect to some drivers issues and all but this will basically get executed and now I have uh basically initialized my DB itself right now we are going to create the Lang chain embeddings L LM objects for letter use so for that I'm going to use I'm going to initialize open AI with my open AI key and embeddings also open AI embeddings with my open a key so I have my llm I have my embeddings okay now is the main step I need to create my Lang Chen Vector store so over here this is what we are basically going to create now right and for that you know we have initialized cendra right we have we have imported cendra now what we'll do is that in this cassendra we will provide three important information what is the kind of embeddings we are going to use what is the table name inside this particular database session none key space none so this is the default parameters we have specifically used QA mini demo is my table name okay just like a question answer table name and what kind of embeddings we are going to specifically use that basically means whenever we store any whenever we push any data in my cassendra DB in my Astra DB itself what it is going to do it is going to probably convert all the text using this embeddings into vectors right and this is the embeddings that we have initialized over here so here is the next step we will go ahead and execute this so this is my Astra Vector store but still I have not probably converted my text into vectors only when when I'm pushing my data inside my DB that time this entire embeddings will uh probably convert that particular data into vectors then what we are specifically going to do is that we will take this data and we will try to uh we'll take this entire data we'll convert into checks uh chunks and we'll also do the text embedding right text embedding while inserting right so here you first of all we are dividing the data or the entire data entire document into text Chunk so for this this we are using character text splitter which is basically present in Lang chain. text splitter we need to split the text using character text splitter it should not increase the token size so here I've given character text splitter I'm saying use the separator slash in use chunk size some chunk size of 800 characters chunk overlap can be 200 and how much is the length with respect to that specific length you can probably provide it over here right and once I probably do this you can see text sp. split text here you will be able to get all the text and if I see the top 50 text you can probably see that I'll be able to see all the top 50 text over here right all the data itself this is amazing right and this is basically from the PDF right all the data all the data right it is basically taking the top 50 right and understand the token size is basically over here as the chunk size is somewhere around 800 okay now this is done we have the text I'm going to just use top 50 and probably store it in the vector database to see if everything is working fine now how to add this specific text now what will happen when I add this text inside my Cassandra DBC axtra Vector store what is this this is basically initialized with respect to the cendal library right so here you'll be able to see that I used embeddings so now when I'm inserting inside the cassendra DB what it is going to do it is going to apply the specific embeddings also so that is the reason you'll be able to see that when we write extraor Vector store. addore text and I'm taking the top 50 top 50 texts over there this will also perform embeddings so that basically means if I see over here it is going to perform this task and it is going to insert in the Astra DB which is having that cassendra over there right so it is going to do this both the steps with respect to this particular code so we are going to add this text and then we also going to wrap wrap this entire inside a wrapper okay so these are the information this is the index that we will be getting with respect to those text so once I probably executed you'll be seeing that in the same database it is going to insert all this headlines okay now finally let's go ahead and text it that basically mean I have my vectors inside my database now it's time that we just query and we ask some kind of questions now I have read this entire PDF guys I could find out some of the question like what is the current gbd how much agriculture Target will be increased and all so I will take this particular example and let's say I'm writing first question is equal to True while true if first question I'm just say that in put okay it will just ask like what kind of question uh you want to type else uh it is just asking you to uh put more questions itself if I write quit it is going to break otherwise it is going to continue now see this is the most important as soon as I give my first question it will go ahead with v Astra Vector index and it will query whatever query text we are specifically using and the llm models that we are specifically initialized and after that we will be getting the answer along with this we'll also be providing some information right like for Doc score or similarity Source score like some other information also right so let's go ahead and execute it and as I said I'm going to use this question okay how much is the agriculture Target to be increased and what focus it will be okay so I'm going to paste it over here I'm going to press enter so as soon as I press enter you can see that it is now taking the information see this um you can probably see over here we are quering this particular DB right and it's going to give me the top four results okay so here you can see that agriculture credit Target will be increased to 20 lakh CR with the focus on animal husbandry Dairy and Fisheries right why it is giving only this much data because I've told that take the 84 characters or 84 words 84 characters text still there and probably give the results right if I increase this it'll give you more result along with this you can probably see that it is giving me stop qu K queries that is the four query Hyderabad will be supported as Center of Excellence some more information but the the most suitable answer that you have specifically got is this one right and this is what probably if you go ahead and search in the PDF if you give the same question you will be able to see the same answer right along with this probably if I want to probably see what is the current GDP if this information is present over there it will also be giving you that specific answer it'll just do the similarity search right so here you can see current gbd is estimated to be 7% isn't it amazing now you can probably take any huge data because at the end of the day you specifically using DB right and finally if you want to quit it I will just go ahead and write quit and this is basically quit right so in short we have performed each and every step now this is what is happening whenever a human is giving an text query text emings will happen and based on that similarity search and then you'll be probably get the output right and this is the entire steps We have basically done step by step now a amazing shout out to data Stacks asra DB because you can also create your own free account Vector search which is super important just understand any type of Q&A applications you can definitely develop with the help of vector search right and uh that is where this data stack Astro DB uh is basically used right and internally it is specifically using cassendra DB so I hope you like this particular video all the information regarding this will be given in the description of this particular video go ahead try it out and yes I will see you all in the next video have a great day thank you and all take care bye-bye
Info
Channel: Krish Naik
Views: 28,001
Rating: undefined out of 5
Keywords: yt:cc=on, generative AI tutorials, machine learning tutorials, deep learning tutorials, langchain tutorials, pdf query chatbots ussing generative ai
Id: zxo3T4aQj6Q
Channel Id: undefined
Length: 23min 24sec (1404 seconds)
Published: Thu Dec 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.