LLM Project | End to End LLM Project Using Langchain, OpenAI in Finance Domain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we will build an end-to-end llm project that covers a real life industry use case of equity research analysis we will build a news research tool where you can give bunch of news article URLs and then when you ask a question it will retrieve the answer based on those news articles in terms of Technology we have used Lang chain openai and streamlit to make this project more interesting we have added some fun storytelling as well so let's take a look at that story first what if Rocky by lived in the chat GPT era how would he invest all his money would he use chat GPT to find best investments no way he would hire someone for that Rocky boys recruitment team got Peter Pandey the equity research analyst Peter read lengthy stock market articles for his research but Rocky by did not like it Peter promised to create a chatbot like chat GPT for his investment Rocky by liked Peter's grid and he said fasten your seat belt so get ready folks we are going to create a chatbot for Rocky by perhaps the rocky boat equity research analysts such as Peter Pandey in our occupy story do exist in real life let me give an example of a mutual fund you might know about all these mutual funds where you can invest your money so all these three yellow color people are the Common People Like Us who are investing their money in the mutual fund and mutual fund will eventually invest in individual stocks now they need to pick right amount of stocks for which they might have a team of research analysts and the job of this team is to provide a research on these companies let's say Tata Motors Reliance how these companies are doing or what are going to be their profits next year how is their management is this a good stock to buy you know they do all this research and in this research team every individual person might have couple of stocks let's say Peter Pan is working for HDFC mutual fund he might be looking at Tata Motors and Reliance and his job is to do research on on these stocks okay so daily comes to his job and he will read a bunch of Articles from money control Economic Times or maybe he has access to premium products such as Bloomberg terminal and he will do all his research based on the news articles the earning report the quarterly Financial reports and so on now you can understand reading news articles from these various websites is tedious task there are so many articles so much information to consume see here I'm showing a p l of Tata Motors why don't we build a tool which looks like this where you can put bunch of news article on left hand side I am showing just three you can have n number of articles and then when you ask a question okay so say I'm showing all these articles and these are like different articles on moneycontrol.com and when you post this question it will retrieve the answer 6.55 to 8.1 lakh that was the answer and it pulled that from this particular attack article see and the article link is in the below okay and you can also say okay give give me a summary I mean it's not it doesn't have to be the number the answer doesn't have to be only one number it can also summarize the entire article okay I know about all this because when I was working with Bloomberg for 12 years uh in Bloomberg terminal we used to get research reports from Jeffrey's openham or all these different companies and we would process that data and show that data on the terminal all right so I hope you have some understanding of the industry use case this is a real industry use case this tool can be used by companies such as Jeffries open Hammer all right folks so this is not some toy toy project let's think about technical architecture now we need to go back to basics in order to build the technical architecture whenever you talk about building any llm app the first thing that comes to your mind is can I use chat GPD for this because chat GPD is free well actually you can you type a question in chargepd and you say answer this question based on below article do not make things up and then from that News website you copy paste the article here and what happens is jet GPT has a capability see EPS is 8.35 it can pull the answer from that given text so the question is then why do I need to build this tool why can't I I use GI GPT for this purpose apparently there are three issues with this approach number one is copy pasting articles is tedious equity research analysts are busy folks they don't have time to you know go to website copy paste and then then get the answer they also need an aggregate knowledge base because when they are asking questions they don't know where the answer might be they might have this question how many uh let's say Tata Nano Tata Motors sold in last quarter now the answer might be in any article so how do they know which article to pull and also some answers might be spread over three or four different articles okay so they need some kind of aggregate knowledge base and chat GPT can't give that and third issue is chat gpt's word limit in chat GPD you can't copy paste a huge article it has a limit on number of words you can supply so we need to build some kind of tool where it can go to the news website which our equity research analyst is putting Trust on and it puts all those articles into some kind of knowledge base so here the database that I'm showing here is some kind of knowledge base and you can build a chat GPT like chatbot which can pull data from that knowledge base now let's think about this particular article on Nvidia let's say I have this particular question okay what was nvidia's operating merging compared to other companies in semiconductor industry and give me the answer based on following article and when I give that entire article it will give me the answer in a perfectly fine expected manner but let's think about it we are building a tool here we are not using chat GPT so obviously behind the scene we will be calling open AI API and whenever you call a open AI API there is a cost associated with it per thousand tokens or you know if you want to think about tokens in a simple Layman language you can say okay word maybe okay so amount of text that you supply to open AI there will be costs associated with it so if you supply more text there is more cost but read the question carefully folks this is very interesting the answer of this question is actually in the first paragraph we don't have to supply second paragraph it is not necessary because it's 17.37 percent that's the answer and therefore you don't need to supply the second paragraph so is there a way we can smartly figure out that okay for this question I need to only give this much chunk if you do that you will save a lot of money on your open AI Bill okay so just think about this article as two different paragraphs and based on the question you can figure out which paragraph to supply in your prompt thinking about this in a generic way you might have a bunch of Articles let's say on Nvidia and when you are asking a question what's the price of h100 GPU you want to figure out a relevant chunk so let's say the relevant chunks were 800 GPU prices mentioned are chunk 4 and chunk two in this case when you are building a prompt you don't need to give all the chunks one to n into your prompt you can just give chunk two and chunk four chunk is just a block of text which is relevant where the answer might be present for a given question okay and when you do that it will give you the financer so the question now comes is how do I find relevant chunks you can't use direct keyword search I can't say Okay h100 GPU it's like control F try to look into all the chunks and wherever Edge 100 gpus prison give me those chunks okay h100 GPU is probably simple example but look at this example when I go to Google and say calories in apple versus revenue of Apple it knows that the first one is a fruit and the second one is a company how does it know that well it uses a concept of semantic search it looks at the context you know we as a human when I say calorie I I kind of figure out it's a fruit and revenue is of Apple is company similarly in any NLP application if you're using semantic search it can figure out based on the context what is the meaning of this word apple is it a fruit or is it a company and we use something called word embedding and or sentence embedding and a vector database for this I have given a separate video for this so because this explanation might take more time so I don't want to spend time explaining all these Concepts if you know this already then fine you can move ahead otherwise the link of this video is in description below so you can pause the video and watch that one first but let's say just for Simplicity um embeddings and Vector databases allow you to figure out so embeddings will allow you to figure out a relevant chunk and Vector databases will allow you to kind of perform a faster search on that database and then you can give your prompt opening and get the answer so now thinking about the technical architecture the first component will be some kind of document loader where you can get all your news articles and load it into that object and then the second one will be splitting that into multiple chunks and then storing that into a vector database so that when you have a given question let's have a given question what is the price of h100 GPU you can go to Vector database and retrieve relevant chunk which is chunk two and chunk four okay so Vector database allows you to perform that faster search because this Vector database might have millions and millions of Records okay so the way Vector databases are designed they help you do a faster search and once you have that question you give it to your prompt and you get your answer in terms of line chain uh we will be using all these classes that I have shown in the orange color and we will be building our application now for a short term phase one we are building this particular tool in streamlit when you are doing this project in the industry let's see you're working as a data scientist for Jeffries you're not going to build a whole project in one go you will first build POC proof of concept where you will build this kind of tool in streamlit you know left side number of article URLs and the right side you put a question gives answer so that way you get a confidence that this approach works and once we are happy with the result of this tool for long term the architecture may look something like this you need to First build a database injection system it will have two different system one first one is database injection system where you go to your trustworthy news article website you write a web scraper and you have it implemented either in Native python or tool like bright data and then you you run that on some kind of chrome job schedule let's say this chrome job runs every two hours or every one hour it will pull the data and it will convert that tax into embedding vectors using open AI or Lama or bird whatever embedding you want to use then that goes into Vector database and for Vector database we can use pine code milverse chroma these are like popular ones today we can use any of these Solutions and that will be your database injection system the second component will be chatbot where in react or some kind of UI framework you will build chat board similar to chat GPT a person types in a question question gets converted into embedding once again open AI or Lama whatever embedding you want to use and then from Vector database you pull relevant chunks so this green and orange are relevant chunks which matches with the question what was Q3 2020 EPS Port automotive and then based on those chunks you form your prompt you give it to a ULM and the answer uh you put it back into your UI for the chat board so again this is the overall architecture that you'll be working with uh remember that when you are working in Industry as a NLP engineer or data scientist you first do brainstorming with your team you come up with this kind of nice technical architecture and then you start uh doing some coding okay you don't want to go into a wrong direction all right so in the next section we'll be talking about tax loaders before we talk about tax loaders make sure you have watched this language in Crash Course so you have basic overview of launching Library assuming that you have watched this course the next step for you will be to install blank Chain by running pip install line chain this is the command that you run um so let me just show you so you will run pip install launching to install language in library and once that is installed you can launch Jupiter notebook and there you can import the class so you can say from langjin dot document loaders import text loader okay so I have imported a simple text loader there are multiple types of loaders that langchain offers and we will look into them one by one so tax loader allows you to load data from a text file here I have a nvidia's news in one text file which is called nvda underscore news underscore1.txt so I will just load that here and and kind of show you how this thing works so nvda news I think one dot txt and I will call it loader and then you will do loader dot load and then it returns a data object and if you print a data object it looks something like this it has all the news content inside it and if you look at it is actually an array okay an RS 0th element is that document which has page content as one one of its elements so let me just show you here in a separate cell so here if you do page content uh Bridge content see it shows you the entire text content that it has the other element that this class has is metadata so metadata is your the name of your text file now if you Google let's say link chain text loader so let me do language chain actually luncheon documentation and if you go to the documentation here you will find let's see you will find the documentation for various loader classes that you have now it is sometimes hard to navigate this and it can change based on at what time you're looking at this but see document loaders python guide will give you all this loaders so this is a text loader the second one they have is a CSV loader so let me talk about CSV loader real quick so I'll just copy paste this thing here and I have a CSV loader class here CSV loader obviously I need to pass a CSV file and luckily I have this movies.csv file which has around nine records you can see nine records where you have movie title industry the revenue and so on I will provide all these files in video description below so make sure you check it code all the files everything will be provided to you here I will say movies dot CSV that will give you a loader and loader dot load that will give you data okay loader.load and if you look at length of data you will find nine records because in the CSV file we had nine records and if you look at the very first record once again you're getting that document class okay you can say it type here and you will see the the document class from Lang chain library and that class has two elements right page content which is a page content so let's see what is what is inside page content so page content is entire record okay entire record in your CSV file which has movie ID title and so on separated by slash n and if you look at metadata it has the movies.csv now one may argue that this metadata I want to have maybe movies name or movie ID as kind of the metadata and metadata is something that we will be using in our project so when remember in the preview we saw when you type in any questions and it will not provide you the answer but it will provide you a source link so how does it uh reference back to the source link the answer is it is through this metadata and then we'll look into it but for now here I will say that my source column here so my source column is let's say movie idea title okay so what are the columns I have so movie ID it has title maybe I can keep it title as a source column so here when I keep it as title what you will see is in the metadata you get kgf2 for the first Echo to kgf2 right see for the first I code kgf to second one Doctor Strange Dr Strange and so on so you can view all the records so here is one record and then here is the the metadata see metadata metadata and so on okay now let's talk about the unstructured URL loader because that is something we'll be using in our project in our project we will be going through some news article let's say this particular article on HDFC bank and we want to load the text content from this article directly into jupyter notebook using some ready-made line chain class and that class is unstructured URL loaded so this is how you import it and by the way you need to install couple of libraries before that and the way you install those libraries is by providing this particular command okay in the notebook if you run this it will install all these libraries other thing is you can copy paste this into your git bash or Windows command shell and install it okay it's just a usual thing folks you should know how to install Library so these are the all libraries that you need to use and by the way unstructured URL loader uses our library called unstructured so if you do python unstructured Library this is the library that it uses underneath to kind of go to that website look into the Dom object the HTML structure and pull all the information so let's create that class and the arguments is all the URLs that you want to supply and the two URL that I want to supply are basically I'll just Supply it to different articles here okay and that will be your loader so my loader is this and our usual step is loader dot load you get data as a return value and when you do length of data it will take some time but it will return two because you have two articles and you look at the first article and see again same thing page content and let's see what do we have in metadata see in metadata you have the source URL link so metadata is the source URL link and we will be using this in our news research tool after we load our documents through loader classes in Lang chain next step is to do text splitting and we have character text bitter recursive tag splitter these kind of classes the reason we do this is because any and llm will have a token size limit that's why we need to reduce the big block of tax into smaller chunks so that it is within this particular limit and what may happen because of these classes that we're using in from line chain is individual chunks that we get after we do split might not be very big or it might not be closer to the Token limit which is 4097. let's say the first chunk is 3000 second chunk is thousand it would make sense if I merge these two so that it is closer to the Limit and it kind of work more efficiently so we have to perform more step so first you have a huge block of text you divide things into smaller chunks and then you can perform merge so that each individual chunks that you are getting which is in this blue green orange color they're closer to that limit which could be 4 4097 2000 depends on the llm that you're using we also want to do some overlapping so that when you are reading this orange paragraph you need some context from the blue paragraph which is which is you know one step ahead so you see part of this blue paragraph goes into orange also so that is chunk overlapping similarly part of this orange paragraph goes into this green chunk also so you see this this oil thing at the top that is called overlapping the chunks all of this can be done using some simple apis in line chain so let's look at it here I have taken the Wikipedia article of Interstellar movie you might have seen that science fiction movie and we are going to perform text splitting on this one now when you think about text splitting let's say I have a limit of 200 tokens that I'm using in my llm how do you split this tag so that each chunk is is of size 200 well the obvious thing that comes to your mind is why don't you we use Simple slice operator in Python and kind of divide things that way but when I do that you will notice that it might cut off the words in between see mat what is Amity mad Daemon demon right so here it is kind of cutting that off and doesn't look that great you at least want to have a complete word so this simple slice operator is not going to work then you'll say oh what's a big deal I might write a for Loop this kind of for Loop where each of the chunks so if you look at these chunks each of these chunks is less than 200. okay you can do that but again this writing this kind of for Loops is little tedious and it can have other issues as well launching provides a very simple API so that you don't have to do all this work manually and that API is given through various text splitter classes so let's try the first simple one so from Lang chain dot tax splitter you can import let's say character text splitter okay and when you have this character text glitter class I it will take separator as an argument separator as in through which character you want to several things out here let's say we want to separate things out with new line character which is slash n which means each line can be one chunk or multiple of those lines because it will do more steps also and my chunk size is let's say 200. as such it is like four thousand something but just for Simplicity we are saying 200 and chunk overlap I will I will keep it zero just to keep things simple and this is my uh later okay and that splitter I can use to split the text so I will say select text and here is my text and what I'm getting as a return is the chunks and let's check the chunks length okay chunks length is 9 and if you look at it see it's like one this is first chunk second third and so on and if you look at the individual chunks length let's say for Chunk in chunk we will bring the length of the chunk you will notice that while most of the chunks are less than 200 there are some which is more than 200 see the chunk size for these chunks is more than 200 so why did that happen well let's look at some of those chunks so the last two are kind of big so if you look at the last two ones you will notice that see this one is pretty big in that entire chunk there is no slation it's like a one big or multiple sentences without slash n so obviously uh it can exceed the size maybe you can change this slash into dot so that you know after every sentence ends yeah it will just take that as one fragment but what if you have a bunch of questions you might not have dot well you can use space but it no matter what you use you will always face one or the other issue for some cases character text splitter will work but we need something little more advanced which can split things on multiple separators and maybe we can have some rules that first divide things by 2 slash n then one slash n then dot then space things like that and this is something that you could do using character text splitter which is of recursive nature and it is called a recursive character text splitter okay so in recursive character text glitter the arguments are kind of going to be the same except that you can provide a list of separators see here you provide just one separator here you can provide list of separator so I can say okay my first separator is lesson second one is less n third one could be dot or less space and chunk size and chunk overlap I'll just keep it same and I will call this R later okay and we will split our text so let's say you are splitting your text you store it in chunks and you look at the length of the chunks okay so total 13 and I will also print the size of the individual chunks here and you can see that majority of them are now actually all of them are less than 200. so let's understand how it works under the hood so what this will do internally see you'll just make one API call but internally what it is doing is this so first it will try the first separator which is this correct and it will split the things so now we had a big text group it split that into three chunks one two three okay that is what it did and once again if if you want to print the size I can print it here see all three of them are more than 200 size because we are splitting using slash n so then uh let's think about the first chunk itself so first chunk is this I will just call it first split just to kind of keep things simple so this is my first split and if you look at the length of the first plate it is 439 when Langston detects that see first it will separate things out with using slash and then it will check individual chunks so individual chunks are three okay this is 439 so when it says that this is more than the specified size which is 200 it will further split that using the second operator which is slash n so here what it will do internally is it will say split this using this lesson and it will get again three more splits okay and if you look at the size of these three splits let's look at the first split so first split if you look at it or let me just say second split is equal to this and in the second split if you look at the length of the first first one it is less than 200 so it is fine second one is 121 it is fine but if you look at this third one it is 210 so it is definitely more than 200 so what it will do is it will then go and look at the third separator which is space and then it will separate things out and once it separates things out then it will again merge if we looked at the most step if you remember from here see it will also do merging because each induces chunk size might be too small so it will just kind of merge everything so obviously when you split things apart using base so let's do that so when I split things apart using spill a space obviously each chunk will be very less right this chunk is only three character this is two character one character we can't have chunk which is that small so it will then merge those things as I have shown you in the slide it will merge things such that it is kind of optimized so what it will do is this see let me just print for Chunk in second split print the length of the chunk first two it will keep it as it is the third one it will divide so what is our size 200 so it will say it will create one chunk using size 200 the other one using 10 roughly it could be plus or minus B because you need to keep your word intact you can't break the word apart so that that is the reason when we are using this API see you've got 199 and 10. see 210 I mean there is a space character so one character is here and there but you see 105 120 so 106 121 okay so those two are same and then there was 199 and 10. so it split this 210 into 2 1. you know 199 and 10. so it is like 209 in one space character so total 210. so that is how it is doing the splitting I hope you got some understanding of recursive text editor this is something that we will be using in our news research tool now that you have some understanding of text better let's look into the next step which is Vector databases now for Vector databases there are a lot of options Pine con Milos chroma but we are not going to use them in our project we will use something called phase which is a kind of like a lightweight in memory Vector database type of thing face stands for Facebook as similarity search it is actually a library that allows you to do faster search into set of vectors that you have uh but it can be also used as a vector database if your project is smaller and if your requirements are kind of lightweight requirements okay so you can read about face but I will give you a very quick understanding so what will happen is once you have set of chunks that you have created using recursive text splitter for our project we will convert those into embeddings see embedding conversion is a must step we can use either openai embedding hugging face embeddings word to act there are so many embeddings out there in the world based on a problem statement we can use any of them and then we will store them into a vector database so if you are using pinecon or miliverse we would have stored these into that proper Vector databases but for our project we are just going to store them into phase index this is like an in-memory structure which can do a faster search on your vectors so let's say if you have an input question call what is the price of h100 GPU we will again first convert that into a vector using the same membering technique and then we will give it to phase index and what phase index will do is let's say these vectors that we have created out of these chunks are let's say I have 1 million vector phase will efficiently perform a search for a given vector and it will tell you out of those 1 million how many of those vectors are similar okay and this I have explained in detail in this particular video which I am going to provide a link in the video description so please watch it if you haven't seen that but let me just quickly show you how face Works uh by using some simple code so you have to first install these two libraries okay and once these libraries are installed I'm going to import pandas um and I will just increase the pandas data frame column width I'll explain why I'm doing that later on but I am loading a CSV file which has like eight records you know so different eight text and their category they are either Health fashion these type of categories so I'm loading that uh into a data frame here and my data frame shape looks something like this and my data frame look something like this now I will convert this text okay these eight sentences into vectors and the way I'm going to do it is using the sentence Transformer Library so I will say from sentence Transformer import sentence Transformer okay and for this sentence Transformer I am going to use a model or a Transformer entity called this L all amp in it and if you want to read more about this you can just say a hugging phase sentence Transformer you can do reading and you can kind of figure out how it works but in simple language all they're doing is converting this text into a vector so how does it do that so I will just say encoder is this and then encoder dot encode okay and that encode expects an array of text an array of text is DF dot taxi DF dot text when you give it it gives you that that entire column and that you can store in the vectors and let me just print the vector shape it might take some time by the way if you're running for the first time just have some patience you can see that there are total eight vectors if you see this it's like a two dimensional array okay so first one is this second one is this third one is this and so on um so meditation and yoga can improve mental health the vector corresponding to that is this one and you see dot dot so the total size of one vector is 768 which I'm going to store it in a in a variable so see so vectors if you do Vector dot shape and one so this is the size of each vector and we have total such eight vectors I will store this into a parameter called Dimension because that Dimension I am going to use later on and then I will import the phase Library okay so once phase is imported I will call index flat L2 so this is uh the index that uses a euclidean distance or L2 distance okay so that's the index that we're using once again if you want to know more detail you can go to either face dot AI or go to their GitHub page you can do more reading but as such it is very simple it is just creating similar to database index it is just creating an index that allows you to do faster search later on okay so here I can supply them actually and that will be my index so I'm creating an index of size 768 here and when you print that index you will see nothing it just created some empty index now in that empty index I can add some vectors correct so when I edit it now my Vector is kind of ready so going back to that picture again we have total eight vectors right total eight vectors and the size of each Vector the size of this particular array is 768 we are just adding that into phase index now phase index will internally construct some kind of data structure what the data structure is or that is out of the scope of this video but some data structure that allows you to do fast similarity search so for a given Vector we can find okay out of this eight Vector which two vectors or which three vectors are similar okay so here now once I have index I can do index.search and here I want to supply search Vector but what is the search Vector well we don't have search Vector ready so I will give some input search query and the search query is let's say I want to buy a polo t-shirt okay and that search query we have to of course encoder dot encore if you look at that picture we need to kind of convert this into a vector so that is what I am doing here when I say encode encode my search query and I get the vector back if you look at the vector shape see it's a simple array 768 but this search Vector expects two dimensional array so I'm going to use numpy and convert this vector into two dimensional array so it's simple folks what I did is something similar to you know I put let's say Vector into an empty array outside so it was one dimensional array 768 now it become two dimensional and if you print that see Vector was simple one dimensional array but if you look at this asvac now it is same Vector but see there are just there is one outer array outside outside that and the reason is this particular function expects that format okay uh okay some argument is missing how many uh similar vectors do you want this is like kth nearest neighbor so let's say I want two similar vectors and it gives me these two vectors okay so it returns a tuple and the first one is the distances the second one is the index in our original data frame so in our original data frame locate the rows which has index three and two okay so which one is three and two so 3 and 2 both are articles related to fashion and you can see that I want to buy a politician is kind of similar to Fashion okay so that's what it did I can store them in distances and I I can store them in double and if you look at I write 3 and 2 you can locate that using DF DOT log so if you do DF dot location and do three and two it will give you those articles and if you want to be kind of like in a programmatic way you can do it similar thing so I of 0 is 3 and 2 only okay so that's what it is giving you now one thing you might have noticed is in this text okay let me print the text here search query so in this text I want to buy Polo t-shirt see the exact word is not present here see so this is not like a keyword search this is a semantic search which means it is capturing the context or the meaning of this sentence and giving you the similar sentence here if you look at our entire data frame so it has meditation in yoga and all that but it it gave you only fashion related articles you can change the sentence so let me say that an apple a day keeps the doctor away okay and I will go here and I will say run all the cells below this particular cells uh okay I think there is some problem okay let me run all the cells here you notice that when I say an apple a day keeps doctor away the search results were related to health once again you will see that in the similar vectors which are these two the exact words are not matching see an apple they keeps the doctor away it is not present in any of these sentences but if as a human you have to think which are the two similar sentences for an apple a doctor keeps doctor away out of all these eight you would probably give these two because we are talking about health here and it gave me the health related articles okay you can try something else which is looking for a place so let's say looking for a place to visit holidays okay and go to cell run all and you notice once again it gave you two articles which are related to travel so I'm going to provide all these individual notebooks by the way in the video description below so just check it just think about it and you will get an idea so uh this was just a quick demo of phase Library uh this is something that we'll be using in our news research tool project let us now discuss the retrieval QA with sources chain once you have stored all your vectors in a vector database the next component will be asking a question and retrieving all the relevant chunks let's say my relevant chunk is chunk number two and check number four using these chunks I will form an llm prompt The Prompt will be something like I have as 100 GPU what is the price of it give me the answer based on the below text which is strung 2 in chunk four and then llm will give you the answer the benefit of this is you can tackle the problem of the token limit and also save some Bill on your open API calls so now when you think about combining this chunk see here what we did is whatever chunk you get you put all of them in one single prompt now as a result here I got chunk number two and four but actually I might get more chunks let's say I got four chunks and combined size of this chunks is more than the llm token limit so then that is the drawback of this method this method is called by the way stuff method so you're getting all the similar looking chunks from your vector database then you're forming a prompt and when you give all these chunks together it may cross that llm limit token limit so that is a drawback of this method if you know that that chunks will not cross llm token limit then is fine the stuff method will still work it is the simplest of all but the better method especially when the combined chunk size is bigger is map reduce in map reduce method what we do is we make individual llm call per chunk so let's say I have these four similar chunks so for my question let's say what is my h100 GPU price give me the answer based on chunk one then again I'll ask a question what is the price of h100 gpusi give me the answer based on chunk two chunk three chunk four so you are asking four different questions and each time you pass different context which is chunk one two three and four obviously you get four answers here there is a typo by the way this is fc1 fc2 fc3 and 4 and so on so this is like a filtered chunk or or an individual answer and then you make a fifth call and you combine all these answers together and you say to your llm that out of all these four answers just give me the best answer or just combine all these answers together and give me the final answer this way you will tackle that that token size limit but the drawback here is you are making five llm policy one two three four and five in the previous method you made just one call so there is always a drawback so now let's do some coding and try to understand this thing uh in a little deeper fashion I have imported all these necessary libraries you need to give your open API key here okay if you create a free account they give you like five dollar free credit so you can use that and after this account is created you get that e okay we have covered all of that in our Lang chain crash course so here I will create an llm object and then I will use unstructured URL loader this is something folks we have already looked into it so I will not go in the in the detail here I'm just loading two different articles so first article is on Tesla okay Wall Street raises Tesla whatever and the second article is on Tata Motors which is india-based automotive company and here we are loading both of these articles into our data loader and then we are using the same recursive text splitter which we have looked into before and creating this individual chunks so we created Total 41 individual chunks I mean you can check the individual chunks here see this is the page content okay 0 1 2 3 4 whatever you can you can check it out right like nine and then once that is done you will create open API embedding so how do you do that well see we created this this particular class here so I will say m b dings is equal to open APM weddings and then uh you will use the face class that we imported here and call a method called from documents now see from documents method in Phase will accept the documents or the chunks that you created here and then it will take another parameter which will be your embedding so here I am using open API embedding so I'm giving that you can use hugging phase or any other I'm wearing too and the resulting index will be this Vector index which we have and once you have that Vector index so I'm not running this code by the way because I already run this code and I have saved this Vector index into a file okay and the way I saved it is I can use this code see you have Vector index and then Vector index you can write it to a file called Vector index a pickle file so previously before shooting this tutorial I already ran this code I saved Vector index into a file and let me show you that file on my disk see Vector index.p pickle file so this is short of like a vector database it is a vector database which I have saved as a pickle file on my disk and now I can load that pickle file while running this core see now I can run this code and my Vector index is loaded into memory so this Vector index now have knowledge of both of these articles now let's create a retrieval query with sources chain or class object the first argument that it expects is llm so wherever we created our llm which is here in order just to okay I'll just put the thing here and the another argument is retriever so retriever is basically how you're planning to retrieve that Vector database so Vector database you can give as an argument here and you can just say as retriever okay this is just a syntax that we're using and this thing we are going to call it a chain okay here I need to save from llm and you will see that it has created this chain another chain you will see interesting prompt which might get you very excited which is this use the following portion of long document to see if any text is relevant to the answer of this question uh this is the thing we have discussed before now we are going to ask some sample question so my simple question is what is the price of Tiago icng and if you look at the article uh the Tiago icng price Diago icng prices between this and that this so you you would want this particular answer from our code okay so that is my expectation I will I will enable the debugging in line Gene so that I can see what's going underneath and then you will say chain in the chain the question that you want to ask is this and this is the argument that you give okay so now let's run this code and see what happens so this will kind of show you some internal debugging so when I said what is the price of Tiago icng first it retrieved the similar looking chunks from my Vector database so there are total four chunks one two three and four okay and the question is same for all of these so see the chunk is the company also said it introduced Diego see my actual answer is reading uh first chunk itself but it will still retrieve four similar looking more similar looking chunks now see this is also similar looking but my answer is not there exactly this is also similar looking and this is also similar looking okay all this code is available in the GitHub by the way you can run it and you can just go through it yourself so that is Step number one which is chunk one and then you combine a question and you ask four questions individual questions to your llm so all these questions will go to the llm so my first prompt my first prompt is use the following portion of a long document command to see if any of the text is relevant to the answer of this questions right return any relearn text word uh verbal theme and this is the paragraph that you give okay similarly this is the second question this is the third and this is the fourth question so these are the fourth questions so you passed all those four questions to llm so there are four llm calls as a result you will get four answers correct so let's see the four answers now so the first answer is this the Tiago icng price so you know this is the final answer but still it will give 504 answers so this is the first answer this is the second answer this this answer doesn't look good but it will still give you some answer this is the third answer and this is the fourth answer okay so fc1 fc2 3 4 these are the four answers that it gave now you will combine those four answers in a summary chunk and give one more call to your llm so where is that question see these are the four four answers okay one two three four now here is the combined summary what is the price of a IC angular CNG so the summaries is content See four answers are combined so given the exit prompt is given the following whatever summaries give me the final answer okay and then in the end it will give you if final answer and the final answer is the Tiago sng price is between this and this is the source reference okay so once again it is using this map and reduce method I have given this notebook with a lot of comments so maybe you can read it and get an idea but overall uh now what happened is we looked at all the individual components we started with text loader splitter phase we talked about retrieval uh just now now we are going to combine all these pieces together and build our final project our final project is not going to take much time because all the individual pieces are ready we have also understood the fundamentals behind the scenes see see learning only API is not important you need to understand the fundamentals how it works underneath then only you can become a great data scientist or NLP engineer so far we have cleared all our fundamentals our individual pieces are ready we need to just assemble them and it's not gonna take much time I'm super excited to move on to the next section now and this is the argument that you give okay so now let's run this code and see what happens now we are going to use all the components that we have built so far we will assemble them and build our entire project I have this directory where I will be keeping my project in the notebooks folder you will see all the individual notebooks and here outside I will do main project coding right now you are seeing two files requirement.txt and Dot EnV file if you look at requirement.txt it has all the libraries which we are using so you can run pip install minus r requirement.txt to install all the libraries in the dot in me file I have open API key so you will put your own key here which you have got by you know getting the fiber or free credit on open AI or or if you have paid account just use that now let me create a main dot Pi file here so I will say main dot pi and here I am going to import all the necessary libraries so since the list is pretty long I'm just going to copy paste it from somewhere in the very first thing I will do is load that open API key now so far we were using OS dot environment variable but that's a little clumsy way of doing it there is a better way which is using the dot EnV python module so if you look at this particular python module uh you have to install it first of all and then you can just use these two lines and I'll tell you what what they do actually so here it will take all the environments variables from dot EnV file and it will load them into the environment okay so if you look at let's say EnV file dot Envy file is this so it will set domain environment variable as this root URL as this and so on okay so it's just one call and within one call we have loaded our API key and the API key is not visible in the code so it's kind of like a cleaner way is standard practice that people are using nowadays uh for loading this now let me just write some basic UI so I will say St dot title okay our title of the application is news research tool so that's what I'll say here and then I will create the sidebar so if if I to show you the UI of the uh our tool it will have on the left hand side it will have this kind of three URLs URL one you are so this is URL one this is URL two this is URL three I will have process button below it and the title that I'm putting is here it will be visible here and here I will have on the right hand side the actual question and below that there will be an answer okay so in the sidebar I am going to say Side by sidebar title news article URLs let's say that's my sidebar uh title and I will have three URLs so I will say for I in range let's say three okay and St dot sidebar dot text input so I will take the URL from text input and I will use a format string here I will say URL I plus 1 because it starts with zero so I'll say you are at one URL two and URL three okay and below that there will be a button so the button will be called process URLs maybe okay and when you click that button that value I will get in process URL clicked and when I say process URL clicked here when I press that button the flow will go here okay let's just run whatever we have so far and see what happens okay so the way to run this is streamlit run main dot Pi uh when you run that it will show you this kind of a UI see three URLs there's a news research tool here we'll add the question box but here you can add those URLs and when you hit this button process URL button actually the flow will go to this if condition okay so let's write some code inside this if condition so here we will use the unstructured URL loaders and we will get all the URLs now here you need the URLs right so where are my URLs so let's say I create this URL variable and those URL you can you can build that array here so whenever you enter that URL it will come to this array and that array is passed here okay and that is called loader and you do loader dot load okay this should be very apparent to you so this first tab is loading data after you load the data you all know next step is splitting the data and for that you will use recursor recursive character text splitter which has two arguments I guess separators and chunk size I am not worrying about chunk overlap that much although you can play with it so this is my text splitter and I will say from stack splitter split my documents and my documents will go here and I will get the individual chunks okay and those chunks I will then do embedding I'm going a little fast because we have already covered all these things before create embeddings and okay let's say create embeddings okay so these are the embeddings I have and then you will say face from documents and the first argument is documents and the second argument is embeddings okay uh and save it to face index and we'll call uh this particular variable let's say this okay and then we are going to save this in memory phase index that we have on this so we'll save it in a picker format and here is how you save it the file path is going to be you can give any file name okay so I'm just going to give plus a phase to open AI picker okay so that is that is stored here and just to show the progress when we are processing the URL so let me just show you the UI here when you enter bunch of URL here process URL here like below this news research tool here I want to show up progress bar so progress bar will say Okay loading the data splitting the data things like that and for that we have to use a main placeholder so I will say main underscore Place folder is equal to St dot empty so this way you are creating an empty uh kind of like UI element and when you are loading the data you can say dot text and you are saying okay data loading started okay so I'll just again copy paste here and I will uh put this kind of progress bar before I think split document and also here all right so so far I think my code looks good I can rerun this and see how it goes so I'm gonna rerun so you can click on rerun or just Place R button so my code is written now and I'm going to give these three articles here so this is my first article then this one is my second article so first second and the third one is this all three are articles related to Tata Motors and see it will say data loading started so now it's loading going to these three URLs using unstructured URL loader loading the data then it is splitting them into a thousand uh token chunks then it is building embeddings using open API calls and then it is using phase to kind of build an index and save it to a disk so when you look at the folder that we have uh you will find this file phase door open AI pickle this is like our Vector database so now our Vector database is ready so next step we will enter a question box here okay so let's do that coding now I'm going to say mainplaceholder dot text input and the text input will have question and whatever query you are getting that question you will say if query so when you type in a question hit enter it will come here now what should be the first thing that you'll you'll be doing here well you will be loading the the vector database so we'll say okay if the file exists there could be a reason where this particular file you know doesn't exist so you want to make sure you're doing that check if the file exists then you need to read that file so you will say with open file part read it's a binary file as F and pickle dot load so you load that file and you call it Vector store Vector store okay we are using the pythons like pickle model you can see here we have imported that and then we are creating the retrieval QA chain okay so Vector store Vector store is here this QA chain expects llm as an input so I don't think we have created llm so let me just copy paste it here I I have the code ready here so that I can save time on the recording so I have created llm object now my chain looks good and then in chain you will say uh what is the format so you supply a question here okay this particular argument your supply is true and then you will get your result now result will be a dictionary which will have two elements okay so result will have it will look something like this so it will have um answer and it will also have whatever is the answer so it will have whatever is the answer and it will also have sources which will contain the URLs or whatever it can be an array so result answer contains answer we need to display that right so s t Dot or mainplaceholder dot you know what I will just use St dot header here and say here is my answer and then St Dot subheader and that sub header you will have a result answer okay so let's try this much so far so I'm going to bring that UI here click on rerun and it just Ren and I will ask now what is the price of Tiago icng okay so this is a question based on those three articles uh if you look at that article here I think it is here see I see in this price is 6.55 to 8.1 and see it is giving the answer uh properly I want to also see the source like from which URL it retrieved the article so we can do that real quick here just to save time on recording I'm not going to go too much in details because this is very minor so you are basically going to result and get the sources element from the dictionary because in that dictionary this might not be present Okay this may be present this might not be present that's why I'm calling get argument and if it is present then you are creating another sub header and providing sources list now why list because sometimes answer may come from multiple URLs so you need to handle that scenario so let's bring this code here now rerun this and ask the question again see Diego cng's price is this and here is the URL so if you look at this URL uh it will have the answers correct this now you can ask summarization question as well for example I have this article KR choxes and I want to just summarize this article this is uh their recommendation on this talk so I will say can you please summarize this article okay and by the way this is like bold Stone let me just change this thing I'm just going to call it right because that way it is not bold okay so here you see it it already summarized article and it also gave the source for that but but this is like little bigger font so I'm just changing it to smaller font but overall folks this this tool is ready now it's going to be very useful to my equity research analyst my peter Panda who is investing on or occupies behalf because now you don't have to read so many articles whatever question you have you can ask it to this news research tool it will not only give you the answer but it gives you the sources references this is very important folks and by the way with llm boom many clients are building this kind of tool how do I know it well my own company in athletic Technologies we have some us-based clients for whom we are building uh these llm projects and this is a real life use case that I'm showing you so this is not like a toy project it is based on a real things which are happening in the industry document somewhere realization building chat board or similar to chat GPT on custom data so this is sort of like we build a chat board which can answer your questions on custom data which is my these three URLs so we just build the basic proof of concept and it is already working a code everything is given in the video description below so please try it out long term we have discussed this like long term when you're building this project in the industry you will build two components one is data ingestion system where you will write some kind of web Scrapper that goes through all these websites and it retrace all those articles now for web scripting you can use native python or bright data even unstructured URL loader will work but you know it might start not working at some point because websites will detect the scrolling activity and they might block you uh that is the reason why people use tools like bright data which is a proxy network based tool then you will create embeddings it could be open AI hugging phase and store it in a vector database so if I'm doing this in the industry like a big project I will not use face I will use Vector database I can still use phase 4 as a library but otherwise I'll use a vector database and once you have data in the vector database you can build UI in reactor whatever tool and then call that Vector database to retrieve the similar looking chunks and post the answer back to chatbot data data data data lights me I can't avoid it hope you like this video if you did please give it a thumbs up folks see we have done a lot of work this is like a full end-to-end real industry project it's not a toy project and I really appreciate appreciate if you can share this with people spread the word about this awesome work that we have done it is available all the code everything is available to you at no cost so I hope uh you understand this and this project is going to look pretty good on your data science or NLP resume okay so please make sure you implement the entire project as per the guidelines share this project with people and if you have any question there is a comment box below thank you very much for watching [Music]
Info
Channel: codebasics
Views: 31,992
Rating: undefined out of 5
Keywords: yt:cc=on, kgf, langchain tutorial, end to end project, stock market analysis, share market analysis project, llm, lang chain, langchain, streamlit, open ai, llm project, llm project ideas, nlp project, nlp project end to end
Id: MoqgmWV1fm8
Channel Id: undefined
Length: 74min 33sec (4473 seconds)
Published: Fri Sep 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.