End To End LLM Langchain Project using Pinecone Vector Database #genai

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is kesak and welcome to my YouTube channel so guys yet another amazing llm project for you all now this llm project will be quite amazing because from this like this is just like a base you can probably create any kind of app on top of it you can create text summarizer you can create a quiz app or you can create any other app itself that is probably something related to text right so what is the main aim of this particular project is that from this project you will get all the guidance that is probably required to create that production grade application whenever you specifically work in the companies why because we are going to also include Vector search database and this is where you'll understand the power of the vector search database whenever you work in any NLP project something that is related to text you try to convert those text into embeddings or vectors if you have a huge vectors you you cannot just store it in your local machine you probably are requiring some kind of database and specifically with with respect to vectors or embeddings vector DB is very super beneficial why because you can probably apply some of the important algorithms like similarity search or you can also apply any other algorithms that is related to text it can be text classification very much quickly by just squaring it from the vector database and getting the right kind of output so all these things we are basically going to cover it will probably be a longer video because every step by step I'll probably show you I'll will also write the code in code in front of you wherever any documentation is probably required I will also show you all those things so yes without wasting any time let's go ahead and probably see this project my main aim is basically to teach you in such a way that you get the right kind of knowledge and this you apply it in your company and nowadays many companies are specifically asking interviews regarding Vector DB they asking related to open llm models and manymore so let me go ahead and share my screen and as I said we will be doing complete completely from Basics right so here is my vs code I have a document over here budget speech. PF I am probably going to take this particular document upload it in my Vector DB right and then ask any kind of queries from it convert that into a quiz app right let's say this is a general knowledge book I can probably convert this into a quiz app with four options and get the right kind of answer from it right so this is what I'm planning to do other than this any idea you have you can probably do it on top of it right so first thing first what is the first thing that we really need to do over here is that create a environment right and this is in every project I at least make you do this because it is super beneficial because at the end of the day with respect to every project you need to probably create a separate environment so in order to create an environment I will go ahead and write K create cond create minus p v andv will be my uh environment name and then here I'll basically going to use Python 3.10 right so once I execute it it'll ask me for an option whether I need to install or not I will just say why and go ahead with the installation so this is the first step that we should specifically do right and uh it is important step because at the end of the day you should definitely create a different environment don't always work in the same environment whenever you are actually working in this kind of projects the second thing that I'm probably going to do is that create my requirement. txt requirement. txt the reason because whenever I'm using an LM model or anything as such I have to probably install a lot of packages so over here first of all I will go ahead and cond activate activate this specific environment V NV slash okay so this is done the environment is activated and we are ready to go right now from my requirement. txt what all things I'm basically going to use I'm going to probably note down all the requirements over here the libraries that is unstructured Tick toen Pine con client P PDF open AI Lang chain pandas numpy python. EnV and at the end of the day guys uh I would always suggest you to please understand about Lang chain Lang chain is an amazing Library it has lot of power lot of functionalities which you can specifically do the community is huge and many many companies are specifically using it okay so requirement. txt has been saved so I will quickly go ahead and install this requirement. txt so probably it may take some time and before that uh since it is probably installing what I will do I will also go ahead and create myv file right so in this EnV file what I'm actually going to do I'm going to put my open API key so quickly I'll create my open a API key save it and start using it okay so so this will basically be my open API key. EnV so that I can basically load it so along with this python. EnV is also there so let's wait till the all the libraries has been installed okay so this is the initial steps that we should specifically do right our environment is ready we have installed all the libraries that is probably required we have also kept our open AI key because at the end of the day I'm specifically going to use langin you can also do it with hugging face if you want but I will try it with open AI because the accurac is pretty much better in this okay now the next step uh what I'm quickly going to do is that quickly create one file and here I will just show you test ipynb and this will basically be my ipynb file itself and here I'll be showing you the entire code later on you can probably convert this into an end to end project you can probably create a streamlet app but here is the main thing that I'm probably going to show you by executing step by step and what all things are specifically required to create this app again understand what I am planning to do right so I will just show you over here first of all let me just clean this screen and what is my entire agenda like what I really want to create as an llm application over here so I have a PDF okay I have a PDF so this is basically my PDF you can also say this is a data source it can be a GK book it can be a maths book it can get anything right I will first of all load this document right once I load this document or read this document what I am going to do is that I'm going to convert this into chunks right because we cannot open AI hugging face models they have some restriction with respect to the Token size so I'm just probably going to create chunks and this is what we say it as text chunks right after this I'm going to use open AI embedding okay and this embeddings will be responsible in converting all the text CHS into vectors right so this will basically be my vectors I hope you know what exactly is vectors a numerical format for different different text right so this will specifically be my vectors and this I'm going to basically do with open embeddings further this vectors needs to be stored in some Vector search DB so here I will put all these vectors in some kind of vector searge DB now why this DB is required because at the end of the day whenever a human being queries any inputs right because of this Vector search DB here we can apply similarity search right and probably get any kind of info that I specifically want so this is my what my entire architecture of the specific project looks like like right and here this Vector search DB I'm probably going to use something called as pine cone and they have lot of DBS I will talk about the advantage and disadvantage there is some amazing DBS called as data Stacks where they specifically use cassendra DB Pine clone is one right we'll see all the documentation page with respect to this okay so step by step I'm going to basically do this and you can actually do it for any number of pages one more thing that I'm probably going to install is IPI konner since I'm going to work in my Jupiter notebook okay so this all steps are basically happening it is very much good and we are able to see this okay so let this installation happen and then I will set up my kernel okay so these are some initial steps that we really need to focus on and uh understand this project will be the base to create any kind of chatbot application mcqs quiz apps okay question answering chat Bots not only question answering chat Bots text summarizer anything that you probably want right or you can also basically say it as a chatbot that gives you specific answer with respect to a specific domain right with respect to the data that we have so all these things are done now I'm going to probably select the kernel V EnV python 3.0 right I'm going to save this perfect now the first step as usual I will go ahead and import start importing libraries and now we will do it completely step by step okay so what all things we basically required I'm going to require open I'm going to import Lang chain uh apart from open and Lang chain what I'm going to also going to do go ahead with pine cone I will talk about pine cone more when I probably show you the documentation apart from Pine con I'm going to basically go to Lang chain okay and I'm going to basically use something called as document loader doc docent loaders will basically be responsible for loading any kind of documents it can be a PDF file and all so for PDF we specifically use something called as Pi PDF I can also use directly Pi PDF loader but since my PDF is inside the directory I'm going to use Pi PDF directory loader okay so this is the next thing now from the next thing what we need to do is that as soon as we load any PDF we will get all the documents we have to basically do uh text splitting right because we really need to convert those into chunks we cannot take the entire token right there will be a restricted token size right like uh recently open has come up with the open 4. 4.0 turbo right so I think it is GPT sorry GPD 4.0 turbo there 188 128k is the token size right so for that I'm going to basically use from Lang chain dot text spitter I'm going to probably import recursive character text splitter you can also use any other based on this right my always suggestion would be that go ahead and check out the documentation with respect to all the libraries that I'm probably un uh uploading right now the next thing is that whenever I probably convert this into chunks right the next thing that I need to probably convert that into vectors and for that I'll be using some kind of embedding techniques so over here we are going to basically use dot embeddings do openi so it is embeddings do open aai and we are going to import open AI embedding so open AI embeddings is a technique wherein it will probably convert any chunk into vectors right so this is the next step now the next step is basically also to uh import a library that will be responsible in creating a vector DB with respect to Pine call or we also say it as Vector store so here I'm going to basically use from Lang chain Dot Vector dot I think Lang chain dot it should be dot I'm writing comma I don't know why Vector oh spelling is mistaken do Vector stores okay and here I'm basically going to import I think it will be there pine cone right so I'm basically going to also use this Pine con Pine con will be our Vector store uh later on we can integrate this with our Vector DB that is present in Pines conone so the next thing is that I will also import our llm model because we will be requiring llm import open AI right so all these libraries I'm going to specifically use it I will quickly go ahead and execute it let's see if everything works fine you may get some kind of warning but it's okay right so this is your initial load that you're specifically doing now you know that I have an environment variables that that is with respect to hugging face so what I'm actually going to do I'll go ahead and write from EnV import import load. EnV so this will specifically load all your environment variables okay so whatever environment variables that you have with respect to open API key or anything that is required you can basically do this right now I will also be importing OS over here right OS we can specifically use later on okay quickly now the first step as I said we need we have a PDF file we need to read it right so now I will write let's read the document okay now first step while reading the document I'll create a function so that I can reuse it and I'll write read Doc and here I will give my directory I can also give my file for that what library will be used Pi PDF loader right over here I'm using directory loader since I I have to probably give my directory name and then I will basically write file _ loader and I will initialize my P PDF directory loader and basically give my directory path over here okay so directory path over here right so as soon as I give my directory path it will go to this specific directory and it will see whichever PDF is there and it will start loading it okay and then I will go ahead and write file undor loader dot load right right now see why I'm showing you this step by step because everybody should understand what steps we are be doing it later on to convert this into a modular code it will be very much easy that is the reason I'm writing it in the form of functions all these functions will go into your utils.py file okay and finally you can see over here I'll get file loader _ load that basically means it is going to load all the documents and here I will basically be getting my documents right and finally we will return this documents done right now let's check if everything is working fine so here I'm going to basically write read unor Doc and this will basically be returning my document and here I will give my document folder so let me just go ahead and write documents in string right so this will basically be my directory path okay and now if I execute what is this doc it will probably read all the docs that are probably there now see every page by Page content this is my first page second page third page fourth page fifth page like this I have 54 pages in my PDF right 54 Pages now if you also write length of Doc here also you'll be able to see it right so length of Doc I'm going to get 58 so that basically means we have done this first step right we have loaded we have read this particular PDF right now the next step is basically with respect to dividing these documents into texture chunks okay so this is what we are probably going to do in our step two but till here everything is working fine so guys now we have finished reading the document uh now what we are basically going to do is that we are going to convert this into chunks right now for converting this into chunks what we are specifically going to do let's see so here I'm going to write the code here I'm going to basically say divide the docs into chunks and again because of the model restriction of the token size we really need to do this okay so here I'm going to basically use defin I'll create a function which is called as definition chunk data here first thing I will basically give my docs then I'm going to basically mention my chunks size right so let me go ahead and mention my Chun size my Chun size I'm going to mention it as 800 you can also mention it as 1,000 right don't keep a very huge value and then I can also say what about the chunk overlap okay so the chunk overlap like 50 characters can overlap with from one sentence to the other sentence right so here the next thing I'm going to basically create a text splitter and that is where we're going to basically use this recursive character text splitter so first first thing first I am going to mention my chunk size which will basically be my chunk size itself and along with this my chunk overlap which is will be nothing but the chunk overlap that I have basically mentioned great uh so here I get my text splitter and now I'm going to basically take this text splitter and split all the documents based on the kind of splitting that I've actually mentioned so here basically I'll provide my docs as my parameter and then I will convert this into and I'll return this stocks perfect if you want to know more about this chunk uh recursive chunk splitter uh sorry recursive character splitter you can probably check this out documentation also I'm going to specify over here uh this documentation is good to understand what exactly it does and all right so perfect this is my chunk data function now what I'm actually going to do is that quickly use chunk uncore data and try it on my docs file right so here I'm going to basically mention my docs docs is this I've actually got this docs is equal to Doc okay and let's see so this will basically be my documents it is just going to apply all this right it is going to convert that entire document into a chunk size of 800 with the overlap of 50 okay and probably if I go and see my documents you'll be able to see it now see every document has now been properly mentioned right and the document that we are specifically reading is the Indian budget doc document right so any question that I probably ask related to Indian budget I'll be able to get the answer done this is good if I probably want to see the length of the documents also I can also see it okay just to give it get an idea like 58 is the length okay great so this is done uh now the next step what I'm actually going to do is that I'm going to also initialize my embedding technique so embedding technique of open is I right so we are going to initialize this so here I'm going to mention embeddings is equal to open Ai embeddings and here I'm going to use my API key os. environment I can directly use os. environment and I can mention what is my API key so here I can say opencore open AI uncore API underscore key right so this is what is my embeddings so let me just quickly see what exactly is my embeddings so here is my embeddings that I'm going to probably use and this is what is basically used to convert that text into vectors right so quickly we have done this uh then I'm going to probably create my vectors let's say Let's test any vectors with this embeddings okay uh and there are various other embedding techniques like one h word to W uh average word to and all but uh in opening iddings this provides you a much more advanced technique okay and over here also it will provide you some kind of vectors like there will be some Vector size also with respect to this okay like every every sentence will be provided with with respect to a vector size so here if I want to probably check and write embed query and just test it with respect to anything like how are you right and let's see what kind of vectors we will probably get so this is my vectors that I'm actually getting see the entire text and if I want to probably check the length of these vectors it will also give you the length of this vectors okay so it is some something around 15 36 and this length will be super important because at the end of the day where I probably create my Vector database I have to specify this length okay for my problem statement now great now let's create our Vector search DB in Pine code okay and now this step is very much important uh because after this particular step we will be able to see what kind of vector database we probably get okay Vector search DB and pine con so let's go ahead and let's quickly create this Vector DB okay so guys here is the pine cone documentation you can probably check it out uh get started using pine cone explore our examples this this is there what is the main important information about this pineco is that it definitely helps you in semantic search in chat BS also it helps you right where it probably helps you to store the entire Vector right and it provides you generative QA with open integration langin Lal argumentation uh rag also we basically say open a integration and it has multiple uses okay so if I probably show you one of the document or guide right so if I probably go ahead and click on the guide right and this is where we really need to create the vector database I'll show you the steps of creating right so pine cone makes it easy to provide long-term memory for high performance AI application it is manage Cloud native Vector database with a simple API no infrastructure has list Pine con serves fresh filtered query results with low hency at a scale of billions of vectors so if you have a huge data set you want to probably work with the vectors you can probably store it over here in the form of vector database um what is the importance Vector embedding provides long-term memory for AI Vector database stores and queries embedding quickly at a scale you know so anytime it is probably saved if you're querying it you will be able to get the response very much quickly now first thing first how you have to probably create this okay so if you once you log in once you log in over here or sign up you'll be able to see this okay so and here I've already created one index okay but this index will not work because see uh in the free tire right it allows you to just create one index so I will just delete this and show you how to probably create it completely from scratch so this is the name so I I'm going to delete this index because at the end of the day whatever vectors you are probably storing it will start indexing it okay so this is terminated now we will go ahead and create a new index so for creating a new index uh what I have to probably do is give a name so let's say I'm giving Lang chain Vector now this is super important configure your index the dimensional metrix depends on a Model you select right now based on a Model if I probably uh see to it right so here you'll be able to see what is the length that I was able to get from my embeddings 1536 so this is what is the dimension that I'm also going to give it over here and since I have basically doing cosine similarity kind of St or you can also do dot product ukian but I'll stay to cosine similarity because at the end of the day the similarity search that is probably going to happen is with the help of cosine similarity and then finally we go ahead and create the index this is the main thing that you probably need to do this is basically getting terminated and this was the data that I had actually inserted but again we will do it okay so if I go to back to uh um back to probably a index let's see why it is not created or still it is showing terminating it should not show terminating but at the end of the day because one I have already deleted it or I can just change the name if I want so but it is created over here okay Lang chain vector and it is a free Tire in the case of free tire they will provide you this thing right now from this there are some important information that you really need to retrieve one is the environment one is the database name okay so what sorry index name so I'll go back to my code here you'll be able to see see Vector search DB and pine cone let's initialize it I I've imported pine cone I will say dot in it okay so dot in it basically does the initialization here two information are required API key okay API key with some information comma the next thing that I probably require is environment okay so environment something is required so let's retrieve this two information along with this what I can also do I have to also provide my index uncore name so here I will specifically say my index underscore name index name I've already copied it from there so it is nothing but langin Vector let's go and see where is the API key and environment so here I'm going to go back if I go and see there's an option of API keys so this is basically my API key I will copy it and I will paste it over here okay so Ive pasted my API key over here now the environment thing where will I get my environment so if I go to indexes and if I click this this is the environment that I will be able to get it right so this two information is specifically required I will paste it over here right so this two information is done by executing it my Vector search DB will get initialized over there but at the end of the day I need to put all these embeddings specifically all these embeddings uh in my Vector DB right so over there I will again use pine cone pine cone which I have actually initialized and I'll say from documents and here I give all my docs so from documents I'm going to first of all give my doc parameter over here the doc which where which I need to probably store it in my Vector DB and then I will go ahead and write embeddings embeddings what kind of embeddings that I'm specifically giving is the same embeddings that we probably created and then I have my index name is equal to whatever index name I have basically initialized so as soon as I probably execute this you will be able to see that I will be able to get one index over here okay so let me just go ahead and execute it it'll probably take some amount of time because I have a huge data over there but you'll be able to see the changes once I probably go over here okay so if I go ahead and click it and if I refresh it let's see okay whether we'll be able to see everything or not whether the data part and all we'll be able to see so here you can probably see query by vector and all all the data is there if I want to probably see the vector count it is 58 because the document size that you could probably see is 58 right and these are all the information you can see over here all the data has been basically stored and based on the metadata you can probably see it it has already done the indexing now any query that you probably hit it over here in the form of vectors you'll be able to get those specific response okay now to do the query part what I will do I will apply a cosine similarity so I will say cosine similarity retrieve results okay so I will try to retrieve the results over here so here let me just go ahead and write definition retrieve uncore query and here I will say query K is equal to 1 let's say k is equal to 2 I I will probably see the top two query now the second thing is that if I want to probably query I will get the matching results whatever matching results I'll use the same index dot whatever index is over here do similarity let's see index dot similarity underscore search right the similarity underscore search what I can do is that let's see what function I have actually made for similarity _ search because I need to probably get those documents also right so basically uh I need to also create one more function over here right and that function will basically be my similarity uncore search that basically means what is my result that I'm probably going to get right inside this so index. similarity un search is a function that is probably present inside this index itself right so similarity unders search and here I will probably give my query comma K is equal to some K value okay so what is the k k is equal to 2 whatever results I'm specifically getting and here I'm going to return my matching results okay matching results so once I execute it so this is basically my retri query uh example so any query that I specifically give with respect to that particular PDF I'll be able to get the answer now this is fine this is the function to get the data from the database itself right so cosine similarity retrieve results from Vector DB I'll write it over here so that you will be able to understand okay so done this is the function that is basically required now two important libraries that I'm going to probably use one is from langin lang. change. question answering I'm using load question answering chain and along with this I'm also going to use open AI open AI will specifically be used to create my model and here my llm model will be created with the help of this so here I've written open a model name will be tax uh text Dy uh 003 and I've taken the temperature value as uh this one and then I'm also going to create my chain where I specifically use this load QA chain and load QA chain actually helps you to create a question answer application and then here I will go ahead and write chain type is equal to Stu okay so my chain is ready my llm is ready everything is ready now all I need to do is that retrieve my queries right for retrieving the queries I will specifically be using this retrieve query function also so let's go ahead and write my definition I will go search answers from uh Vector database Vector DB okay and here I will basically write definition retrieve answers and whatever query I specifically give over here based on that I will should be able to get it right so if I probably see doc search then I will write doc search is equal to I'll call that same function retrieve query with my query that I'm actually going to give and then I will also print PR my doc search okay whatever print I'm basically getting now whenever I get that specific doc search I should also run this chain chain that I've actually created right to load QA chain so here I'm going to basically say chain. run and inside this what will basically be my input documents so I will say input input documents is equal to I will give my doc search right the document that I'm probably going to search and then the next will be my question which will be in the form of query right the query that I'm actually getting so once I do this my chain will basically be running and it will provide you some kind of response that it probably gets gets from the vector DB okay if it matches right and then we are going to return this specific response done now see the magic okay once I probably call this function what will happen so here I will write our query okay so the query will be I I saw something right I I basically so this will basically be my retrieve answer okay now see this I read the PDF I could find one question over there how much the agriculture Target will be increased by how many cores right how much the agriculture Target will be increased by how many cores I've just written this kind of statement now this retrieve query as soon as we call it will basically sorry retrieve answer as as soon as we call okay so here I'll just make this function change also the spelling should be right right retrieve and here also I'm going to probably make this to retrieve answers okay now as soon as I give my query over here it'll go to retrieve query it will see from the index right it'll probably do the similarity search it'll give you the two results itself so let me just go ahead and see the answer over here now retrieve answer is not defined why okay I have to execute this sorry now it should be able to get the answer see so agriculture credit Target will be increased to 20 lakh CR with an investment of so and so information I I've just put right uh I can also write any other question how is the agriculture doing right so I may get some kind of question answered if it is not able to get something then it will say I don't know okay the government is promoting corporate based this this this this I'm getting that information from the entire PDF itself right so this is how beautifully you can probably get this and now it is probably asking being acting as a question answer application right now see this is what is the base you have a vector DB you're asking any question and you're getting some kind of response now on top of that you can also do a lot of prompt templating you can probably convert this into a quiz app you can convert it into a question answer uh chatbot you can convert this into a text summarizer you can do whatever things you want right and this is what is the specific power over here right and this is really really good and that is what I'm probably going to show you in the next video on top of it like how can I probably do a custom prompt templating on my llm application this is what I'm probably going to show you in the next video but here I've shown you about what is Vector database and why it is very much important what exactly how the vectors are basically stored and here you can probably see see this is how your vector is stored if I probably search for any Vector over here I'll be able to get those kind of response over here right based on that search itself so start using this many companies are basically using this at the end of the day just for your practice SE is completely for free right but if you have if you want more than one indexes two indexes then at that point of time you can probably take a paid tool paid version that specifically requires in a company itself so I hope you are able to understand this amazing project step by step I have shown you everything I will provide you with the coding but next video because it is going to take another 30 minutes for me to complete this I will try to create a quiz app okay on top of it if I'm getting this kind of response how can I do prompt template just think over it try it by yourself I'll definitely wait for your answers but in a couple of days I'll record one more video and show it to you so yes this was it from my side I hope you like this particular video this uh I'll see you all in the next video If you like this please make sure that you subscribe to channel press the Bell notification icon and I'll see you all in the next video thank you take care bye-bye
Info
Channel: Krish Naik
Views: 52,559
Rating: undefined out of 5
Keywords: yt:cc=on, end to end llm project, vector database, pinecone vector database, langchain tutorials, openai tutorials, vectordb, openai embeddings, krish naik llm projects
Id: erUfLIi9OFM
Channel Id: undefined
Length: 36min 0sec (2160 seconds)
Published: Tue Dec 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.