Chat with Multiple PDFs using Llama 2 and LangChain (Use Private LLM & Free Embeddings for QA)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video you're going to learn how you can chat with multiple PDFs using llama2 and Lang chain how can you use Walmart 2.0 to chat with multiple PDF files in this video we're going to load the wama2 model on a single GPU we are going to use some free embeddings in order to analyze earning reports from meta Tesla and Nvidia we're going to watch those multiple PDFs within our project and we're going to use Link Chain in order to create a chain that you can ask questions about the PDFs let's get started there is a complete text tutorial that is available for ammo expert Pro subscribers and it is within the prompt engineering section then chat with multiple PDFs right under projects so here you can find a complete text tutorial along with explanations on Mama 2 and GPT Q2 along with the complete source code and again in a visualizations of some of the data and the responses from the chain so if you want to support my work please consider subscribing to MX platform thanks the documents that we are going to use are earning reports from three different companies and these are the latest filings so in our case we're going to use Tesla this is again the latest earnings report and you can see that the data contains a lot of tables a lot of text it is about 40 pages long and we're going to use this I want with our chatbot next we have Nvidia and this is again the latest filing again lots of text table data Etc and we're going to ask questions about all of this data and the final one is meta earnings report here the format is pretty similar we have some tables we have all of text that is formatted all of numbers so we are going to have a look at how warmer 2 is going to handle these numbers the model that we are going to use is a quantized version of the Walmart to 13 billion parameter models and this model actually fits right within a T4 GPU so this is why I'm going to use this and this one is provided by the block and it's essentially a auto gptq format of the original wama2 model in our case we're going to use the 4-bit format with a group size of 32 and the model size is going to be roughly 8 gigabytes so we're going to fit that into a vram and here you can see that it gives us the highest possible inference quality with maximum vram usage but the speed of the inference is somewhat slower so in our case this works fine but if you want to maybe run some faster inference and then try to or in the way of the inference quality then you might try some of the other models so the mo the model uses actually this Library Auto gptq and if you're not familiar with that Library it is a great library that can you you can use to essentially apply the gptq algorithm which is essentially a smart way to quantize a gpt-like models and this Library provides a way to quantize the models and then do inference with already quantized methods and here you can find some of the benchmarks it gives you pretty much a better inference speed and less vram usage of course there is a hit on the performance but the hit on the performance are actually quite negligible at least according to the most of the benchmarks that I've seen I have a Google Co-op notebook that is already running and at the start you can see that I'm using a Tesla T4 GPU so this is this can be used within the free tier of the uh web machines and in this Co-op we are starting with zero megabytes of vram usage and this is the maximum that you can take uh of course we're using Cuda version 12 and this will be important because we want to install the proper version of autogue so first I am starting by installing a world of dependencies uh here you can see that we are installing torch Transformers like chain this is the latest version as of now chroma DB by PDF then xformers this will probably speed up our inference speed then sentence Transformers and the instructor embedding since we're going to use those within the embeddings that we're going to build and then I'm using PDF to image for I'm going to show you some of the PDF pages right within here next I'm installing the latest version of Auto GPT q and I'm installing the version that is for Cuda 1.8 18 or yeah so this is the the Cuda version that is compatible with this Google what notebook so I'm installing also this and finally I'm installing the popularity so I can work with PDF to image right here okay so next I'm doing all the Imports that we are going to need uh you can one of the most more important parts is the auto gptq foreclosure language modeling and this will robot our model and then we are using pi PDF directory water since we are going to work with the three different PDFs and I have those right within here there are essentially in a folder and you can see that the PDFs are actually right here yeah and those PDFs are available from my Google Drive and I'm downloading it like this right within this directory and these are the example pages that I'm going to show you so these are essentially the first pages of each report that we've went through then I am loading all of the PDFs with the directory order and you can see that we have exactly 100 pages for all the documents this is the number of pages and then I'm loading the instruct embeddings so these embeddings if you don't know are available on the green face Hub and this is essentially the massive text embedding Benchmark or this is a leaderboard for all of the embeddings that are free and open source so here you can see that the instructor watch embeddings are currently at eight place and they are very popular and they perform very well of course there are now much newer embeddings that perform even better and some of those I might try to run in the next couple of videos and we can compare their performance since the embeddings are really important when we are doing such a retrieval augmentation tasks such as this one that we're going to do today and yeah have a look at different embeddings and try to run with some of those and let me know down in the comments if you get much better results compared to what we are going to get here okay so these are the embeddings we're going to use the instructor watch model and if you have um let's say Co-op Pro you might try to use a higher Ram option instance and then you might even go for the instructor XL model and then you might find this into memory as well but you have to try next I'm splitting the documents into 100 or 1024. chunks are per item or characters and then I have some overlap so we have in total 355 documents or texts and then I'm embedding everything using the embedding model within a chrome database and you can see that this takes somewhat a lot of time because this model is not really that fast even though we are putting it on a coded device so yeah after we have this database essentially you do this one so that's all right essentially what we do next is to actually download the model and the tokenizer uh I've started with the 32 group size model and I couldn't fit it right within this Google Co-op instance so previously I've told you that we are going to try that but actually this didn't fit into the vram so I'm using the 128 group size and you can see that I'm using exactly the same repository From the Block at the Walmart to 13 billion shot with the gptq and then we're using the 4-bit model and this is essentially very similar to how you would what with a model with the Transformers library but this time we are just passing in the correct model and you we're using the safe tensors file from the repository and then this is really important I don't want to inject a fused attention I'm doing this just because this model or the autog particular Library doesn't work when you're passing in a woman two point all instance right here and then the device of course we're going to run this on a cooler device next you see that the model took about 7.3 gigabytes of RAM to download or storage to download and before delete it so yeah this is pretty much the tokenizer in the model wallet and after we have both of the models the instructor watch and then this Walmart 2.0 model you can see that we are actually getting a roughly 10 gigabytes of vram so we need that okay so we are fitting within the vram and then I'm going to create recreate the default system prompt this is taken originally from the repository for dawama model and I'm using this function generate prompt which I'm going to pass in the system prompt and in this case this is going to be the default one and then I'm going to use the format that is available for those types of warmer 2.0 models you can see the instruct and then the system prompt and then right here I'm passing in the system prompt and then this is the prompt that I'm going to use at the end so in order to get the responses to stream we're going to use the text streamer and these are the responses from the model and I'm going to create a pipeline which is actually a Transformers pipeline from text generation passing in the model the tokenizer I want 124 1024 maximum new tokens and I want the temperature to be zero so I can reproduce the results and we have some repetition penalty and top P sampling for the tokens these are pretty much the default settings that I found for the Llama 2 but the temperature is set to zero since I want this to be repeatable and then I'm passing in the streamer okay so we have this Pi point now and I'm going to wrap it within the length chain hanging phase pipeline I'm going to just pass in the text Pipeline and then again I'm passing in the temperature equal to zero just in case uh that's this didn't work out as expected maybe this is redundant you have to try that on your own so the first thing that I'm doing is to actually change the system prompt since I want something much more specific to our case and use the following pieces of context to answer the question at the end if you don't know us I'll just say that you don't know don't try to make up an answer a pretty pretty direct prompt I'm not actually telling it that I want to be an expert on earning reports or financial statements Etc maybe this might be a better system prompt than what I have since this is pretty generic but this is the format I'm passing in the context and then the question itself pretty standard prompt and then I'm passing in the system prompt as well so for the Watch chain Library I'm going to use the prompt template in which I'm going to pass in this template and then I'm going to say that the input variables or the variables that we're going to pass in are the context and the question so these are the two and finally I'm going to get everything together into a retrieval q a chain and in this case I'm going to pass in the 112 model which is the auto gptk model I want to stuff everything together and then I'm going to get a retriever from the database and here I'm passing in k equal to 2 and this is it because currently the auto gptq library has a book that you can't essentially pass in the full prompt length you can only pass 2048 tokens and once this book is complete I'm going to update the tutorial online expert Pro for amexport pro subscribers and I'm going to pass in the complete possible number maybe I will increase this to 4 and try to to work with that but yeah again this is a current limitation based on the book that I've already told you about and I want this to return the source document so where the answers is coming from and then I'm passing in the prompt as a keyword argument so this is the problem that we're going to run with and let's just try this for this demo I'm going to say that I want this to be verbose when I run this chain and this is the first question that I'm going to ask what is the per share revenue for method during 2023 and yeah here is actually the meta report and if you go through the through the report and this number is actually available yeah earnings per share and you can see that the basic for 2023 at least from here for the three months ended June 30th is uh 303 or the diluted value is 298 so if I run this let's see what we get okay so this is pretty much the answer uh and it actually picked up the the illusive value I believe yeah two 98. so pretty Stir It Forward and let's have a look at the documents that it took yeah we it got two documents and then this is pretty much exactly diluted rings per share 2.98 so this might be actually another place where it find it 298. ah yeah here is so this table actually has the number that it took it from okay so it found the answer for this one but if we ask about Tesla it says that the financial statements doesn't provide such an answer which I believe is wrong because if you go to the Tesla disclosure and you write something per share yeah here you have some numbers of basic and diluted values of the shares for 2023 ending at June 30th yeah so we have essentially the same table the same numbers the same information right within the test for report but the model didn't catch it up so what happened here was actually that the embeddings didn't correctly identified the information so essentially what we did is to give the model something that didn't provide this information so in our case I shouldn't or couldn't say that the Obama 2 is failing but then probably our embeddings didn't do a good job and I'm asking the same thing about the Nvidia and it says that during 2023 it was 0.83 cents and let's see where we get this yeah net income per share we have a table right within the document and here you can see that the basic per share but not the diluted is a 0.83 so we have this information for here okay so all of the documents had this information but for the Tesla diabetics did failed us but the model is doing a good job right here then we have what is the estimated year-over-year revenue for method during 2023 is expected to be so it is answering that this is expected to be between 32 and 34.5 million uh which represents an increase of 11 to 16 percent and I believe this is correct from the documents and then for the Tesla is doing some strange calculations in which it is calculating the first half of 2023 Revenue times two not sure why it is doing that now if you know if there is something about calculating this type of revenues and that includes the first half of a year times two I'm not really sure and it is doing some calculations and I believe this is completely wrong so yeah I think that it didn't do a good job right here and then for NVIDIA again it didn't give us a good information I'm not sure if the document contained this but yeah it didn't work for NVIDIA either and then I'm asking which company is more profitable during 2023 and it says that it can give us an answer because the essentially the year is not complete and it can pick one but then I'm asking the final question which is choose one company to invest in to maximize your post visits for the long run and starts by the let's say the charge GPT way that it is not a financial advisor Etc of course this is not a financial advice video as well uh and it's not it says that it is not appropriate for me to give Financial advice or make recommendations on specific Investments and then it's actually starting with something a bit more concrete or a lot more concrete based on the information provided in the context Nvidia appears to be a strong company with a diverse range of businesses and a solid track record on Innovation and growth nvd has a long history of developing Cutting Edge technology including GPU architecture and has successfully expanded into new markets such as artificial intelligence data science and autonomous vehicles the company has also demonstrated a commitment to sustainability and if this is something that you're it's important to you and social responsibility which might be attractive to socially conscious investors okay so it thinks that Nvidia is going to be the best company to invest in the long run and you will see if this is the case when the future comes this is it for this video we've seen how you can get a open Watch language model and how you can use some free embedding models in order to analyze and chopped with PDF files in our case we took test was nvds and maintenance earning reports or the latest filings for those and we start asking questions about those PDFs the PDFs were pretty complex and some of the answers were great well some of the others were not that good but yeah you can experiment and see whether or not this might be working better for you thanks for watching guys please like share and subscribe also join the Discord channel that I'm going to link down in the description below and if you want to support my work please consider subscribing to Emma expert Pro and read the complete text tutorial that is available on there I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 10,185
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: QshWJ77Sktc
Channel Id: undefined
Length: 21min 54sec (1314 seconds)
Published: Thu Aug 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.