GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
is mayor from chartered data and today we're going to be talking about how do you chat with multiple massive PDF documents across multiple files say in this case we're looking at Tesla we're looking at annual reports for 2022 2021 2020 and each year the PDF files are huge like here is 2022 is 449 and then 20 20 21 is around 300 and you add it up you're looking about a thousand pages of PDFs and it just goes on and on and on and just Financial reports and all that kind of stuff extremely tedious to go through and read and what we want is a situation like this where you go Warren Buffett asking you what you like to analyze but we can ask a question about specific PDF so what are the or what were the risk factors for Tesla 2022. let's say I start searching and it's going to give us a response so let's talk about cyclical industry inflationary pressures increasing interest rates the pandemic and what we want to do now is cross check because it's going to give us a reference and it points to page 33. it goes page 33. here oops and let's see we are focused on growing manufacturing demanding cells energy regeneration and let's see is this about risk okay so it covers the highlights so more or less it's kind of looking at this section that has the management discussion um and is able to kind of assess the risk factors from there but what if we don't just want to talk to One PDF I mean talking to one PDF is already insane right that's that was a 500 page PDF document but what if we wanted to talk to three of them simultaneously over the past three years and ask a question that analyzes the past three years so how have management formed over the past three years and let's see now this is where it gets interesting and it picks up on the three years of the annual reports and let's see if we can figure out because there was no performance to provide a more accurate processor more context is required fair enough and it manages to pick out sources as well for each of these years now what if you want to ask a more technical question and so we say something like uh what is how has testers gross margin changed over past three years and let's see so now it figures out I'm looking at these three reports increase okay interesting nice and you know it provides references so let's let's check the 2022 reference page 39 let's hop over they usually cover more than one page but let's see nice so we got a comparison of the revenues and is able to kind of pick up these things here as well I think I mentioned some of these numbers pretty accurately too so that's pretty cool so that's what we're talking about here is how do we talk across not just one not just two three or more PDF files to get insights so how does this work let's jump into the diagram and before I do that actually if this is your first time kind of seeing this kind of structure I'm about to show you it's going to be overwhelming so I advise you watch the previous video that kind of covered uh a PDF chatbot a simple PDF chatbot and you can go to the repository here and the link will be below as well and this is the starting point for this um this is a this from from here I kind of customized to design um this demo that I'm showing the demo is very buggy so please uh I'm PR I'm probably not going to release the code anytime soon because it's it's uh I'm I'm testing it as I'm recording this video so here is the multiple PDF chapter architecture now if you've seen the previous one in the repo or the previous video you're probably like what is going on here okay um just bear with me I'm gonna go through this slowly okay we're gonna go through make sense of what's going on so first of all we have the PDF Docs of each PDF that you have so in this case I have say 2021 Tesla and reports 2022 teslan reports I convert each of them to text because PDF is binary right so you need it in text and once we have the text we split each of them into chunks because open AI has context window that only allows so much and so we have these chunks we pass it to open AI to create these embeddings which are just number representations and this phase from here we'll call ingestion it's the process of converting your docs into text and ultimately to number representations that computers can understand and you can store somewhere you can search for quickly you can see that the 2022 got stored in these numbers here in this thing called a vector store and 2021 the same thing and what is a vector store you can think of it as a database of some sort that houses the number representation of your documents in different categories or different spaces and let's call these spaces name spaces so our namespace you can kind of just think of as just a little box in your house right and that box has specific things it could be a box that contains all your clothes or it could be a box that contains all your shoes the point is it contains something similar so in this case this box this namespace in this database contains the number representations of your PDF Docs but also the text of your PDF docs and any other quote-unquote metadata or information that's important related to your documents and the same thing for 2022 2022 in this case we're looking at pinecorn as a vector store and we'll be using uh gpt4 in this case this is Ada we're going to g54 and line chain we're going to jump into that in a second so this is phase one so you do this ingestion now here we go okay so just follow slowly what was Tessa's gross profit margin 2022 now the typical way would be you ask the question and we combine the question with the child history gbt will create a standalone question from these two and then we convert those questions to numbers and we go into the vector store we specify a specific namespace or specific box that we want to retrieve from and then we get the relevant Docs we combine the Standalone question with the relevant Docs and GT4 looks that context question The Prompt response so that's in a nutshell how we've looked at things so far right but when you're dealing with multiple PDF files right you could technically you know have standard question factors 2021 which we we did right where you can ask an individual for individual namespace you can say specify the namespace and say I want information relevant to 2021 I want information relevant to 2022. but what happens if you want to analyze things across multiple years for example or multiple namespaces so you want to be able to say I want to ask a question that is related to 2020 2021 and 2022. and I want to analyze information that's cross-border and so in that case we need a new strategy and so what was Tesla's gross profit margin in 2022 that's a straightforward question right so that would be a base case here right so if I just highlight this a second but what if you ask what was Tesla's gross profit margin in 2021 and 2022 right so now you need a way to search 2021 and 2022 by extracting by first extracting the namespace from the question somehow so we need to use GT4 to help us to somehow extract the namespace from the question so in this case if the namespace is called 2021 we ultimately don't want to hardcore this we want this to be dynamic does that make sense I'll go over the card and kind of go through that slowly too but you want the model to kind of figure out what year is user referring to what namespace is it associated with here whatever name you've decided to give them and then from there we're able to have this Dynamic relationship where we don't just hard code the namespace we've got the namespace down here so when there's a question we go revert the question to embeddings but now we also have this context this dynamic context specifying the name spaces to look at we check the namespaces 2021 2022 retrieve the relevant docs for each namespace and then we're back to the usual procedure so that's why if I go back to the demo it's able to come back with something like this when I specify that I'm looking over the past three years so if I say for example so this is a website called secant Alpha used by a lot of investors and so here it's saying Revenue year on year so let me ask what is Tesla's estimated here on here um over the past now let me just ask that I think that should be sufficient yeah Revenue growth Revenue growth year on here and so we asked the question let's see what happens and so now it's going to ask to specify so I'm going to specify what is Tesla's estimated Revenue growth year or year 2022. and now we've specified and now it's going to search 2022. so here we go let's see where it figures out it's doing the calculations based on the Consolidated statement of operations dividing the numbers and 51.33 let's see what Alpha has say I I have no idea no idea wow that is that's crazy okay I'd not expect that to happen but okay interesting interesting and this is uh trailing as well so this is this is accurate um and so that's what happens now let's uh I just got a bit excited there and that's uh that's insane that's that's look at the card now so I'll jump back to join I hope that makes sense what just happened now it figured out the years then attach it to the namespace and so when the question was asked it knew to ask to to draw the relevant docs from the two uh namespaces okay so let's jump to the card now okay now there's a lot going on here so let me start from the beginning so you can see in the reports folder I have each tester PDF that I showed earlier and I have a function well script that I'm running called ingest data what ingest data does is it will go into the reports folder right here go use a dietary loader from Lang chain to read the file path and as it goes to the file path is looking for PDFs and it's going to load each PDF into text right and so I can kind of show you what that looks like so we're gonna do uh let me just cross check I've got other things going on uh [Music] um let me show you what the output look like anyway so when you've run this script so I'd have run the script I'd have done a uh MPX TSX um fig and then you know had the scripts for Scripts here and ingest data.ts so I'll con I'll cover more of this in the upcoming upcoming Workshop uh but effectively what this does is it loads this page it checks the environment variable where that contains the API keys and then it also runs this it compiles this to JavaScript from typescript on the fly as well without having to um emit the files right so if I was going to run this it would be but I don't want to do that because I've already done the ingestion and it took a couple of minutes but once this process was done let me just clear that before I get into trouble this is what I look like right so you can see I create a file test.json right so this was the first phase right where this is this is literally you can see oh it's a 2020 and then you also have 20 20 21 and 2022 and you can see this is all of Tesla's files in here and what's good what the way I set things up was I have references to each of them and that's why you saw the page number so every page has a page number and also reference to the original source so that was when you saw in the UI that it had that as well and so if you see in ingest in the ingest uh date no testing just also split them so I split the I basically created these Dynamic name spaces right so the dietary loads as I showed in the diagram you basically go through each folder and then you create this grouping of all the different folders in a map again a mapping of each year to a document and I split them into chunks no no sorry I initialize it so I haven't split it up then preparing Pinecone so what's going on here effectively again I'm trying not to get into the details more focus and high level is once we have translated the PDFs into text and we have grouped them say into different categories so 2021 2020 2022 now we need to to split them into chunks right as we spoke about and so here I'm splitting into chunks of one thousand and two hundred overlap and for each chunk for each group I am assigning a namespace as we spoke about called Tesla and the year so what you're going to see what I did later on was once I finished this phase one in my config I change I create a name space years right it's a 2020 Tesla 2020 21 22 and these are all associated with namespaces in my Pinecone um account which I'm going to show you in a second so now we have say test of 2020 we split Tesla 2020 pages into chunks of a thousand characters each and what does that look like well let me show you so we split you can see all of this is just 20 20. literally all 20 20. and each one has the page number right um associated with the chunk so this is page five page five page five Page Six and again the date structure makes it easy to do searches and manipulation down the line so I did the same for 2021 2022 as well so then we have to kind of begin to put into this namespace we're using pinecorn at this point and uh we're trying to insert upset into the namespace but pinecorn has limits right so you cannot just do it all in one go you have to do it in chunks the recommendation is 100 vectors so for contacts again remember your chunks are converted into embeddings these embeddings are numbers so you can call them vectors so these vectors contain your numbers representation of your pages as well as your text and metadata as well so we need to split into chunks so we split into chunks of 50 in this case and each chunk is then inserted into the database right the Pinecone database and this is all using line chain functions to make it easier to run this process as well so once this is all done this is what it looks like so let me hop back kind of just show you so this is my pinecon dashboard so once you sign up you have all these variables here now in the previous video there was a lot of complaints about up certain inserting um the thing with Pinecone is you have to be very uh specific for example your environment is kind of where is your pod closest to and by default if you're on a free plan they will give you one this has to match your environment variable in your code cosine is calculation done for similarity and then Dimensions is a number of Dimensions that open UI creates a dimension is you can think of as one particular spot in Array of vectors so if you have 0.1 1.1 1.2 representing your text that's three dimensions so when opener AI does embeddings it does 1536 so you have to specify 1536 otherwise your the insertion is not going to work make sure your API keys are correct as well and that your index name matches the index name of the configuration of the code again I'm speaking more for everyone who tried to who are trouble with this Repository um which I just want to thank you all for for the support because that's uh it's just trending on GitHub for for a couple of uh for a couple of days so um appreciate that now what was I okay so these are the indexes the index info these are the namespaces and if I jump you can see Tesla 2020 there's a number of vectors for each namespace so remember in the previous video was just one now we have three and now we want to communicate across all three of them so I hope at least this explains what's going on here and in the previous video I showed you can just come here and kind of play around if you don't know what a the vectors look like it's pretty much a case of if I click fetch this is what Vector looks like so you got your namespace which is test in my case and you can see this is the ID and then you have values as I said which is just decibel floating decimal number representations of your text because if I open here you can see the text right it's it's the text so the text is here and these are the 1536 Dimensions I spoke about right so let's go back to the code let's see so once this ingestion process is done again for anyone having trouble with this from the previous video you just want to make sure that your config here the namespaces are matching what's in your dashboard and you've set your environment variable right what whatever whatever that was other than that don't tamper with the versions as well because if you upgrade Lang chain um there's there's some breaking changes so just make sure you cloning exactly as it is and you follow the instructions in the readme which I put there and also you have the video can check out as well but anyway come back to this particular video so once the ingestion is done now we want to go to the next phase and so what's the next phase well let me go back to the diagram just in case you lost track so we're done with this process so now we want to move to phase two we want to chat and we want to be able to specify drag that down we want to be able to specify what exactly or what namespace that we want to retrieve information from we want it to be dynamic and so if I hop back I think I have a script to have it so this is just a like when I was trying to experiment with how to kind of go about this and the way you want to think about it again these all functions from Lang chain and but kind of a bit customized but this is just a high level let's let's think about this highlight so first you have your opening eye um instance this is a line chain opening I chat that basically has the same functionality as the API for open AI except that it has extra Futures like caching and memory and stuff like that so yeah we want to think about on a high level so on a high level basically we have the reports prompt so basically we just have some some prompts that I wrote uh very quickly for this um and don't worry about the syntax this is all Lang change trying to make life easier effectively what we're trying to do is have kind of human problems with the system prompts but the history and we want to effectively be able to dialogue by calling the chain so Lang chain has these things called chains which are just basically a sequence of prompts with llm or call to GPT basically similar to the diagram we have so don't be in too intimidated by the stuff I'll cover this in the workshop upcoming Workshop but and you can also look at the line chain Docs um just focus on the high level and not too much on the chord per se and this is a function to extract the years from the query and basically the query is responding with a response that verifies it says something like this right or it could come back and say like you saw in the UI searching for whatever and So based on that you can extract the year use a regex and once you have the number you map over the number and then you're able to get the namespace right because remember we need to match the number to the namespace in my case this is the way I sell the date structure it could be any whatever you want to do so if the AI says this question is related to 2020 2021 then it the function is going to return the two namespaces that already exist so I hope that makes sense and then you have this custom QA chain and what it does effectively is that it takes the model the index the namespace so the model is whatever your chat open the eye or whatever you open your API is the indexes here so just pinecon index which is the line change as fundamentals and the namespaces is whatever it pulled out so basically what this chain does is it's saying going to the uh Pi cone and effectively it's it's basically set in the stage for the the phase that you saw in the diagram where we have the Standalone question and then we're able to go with a standalone question to retrieve specific name spaces the relevant docs and then combine that and get a response so this is all under the hood that's kind of happening and this is a custom QA train that I made what Lang chain has two others called chat Vector DBQ a chain in Vector DBQ a chain again don't get intimidated but if you focus on the diagram you realize it's not about the chord you can you can do it other ways this is just one way and so I said chat history to nothing and then we make the call and stringify and we're good to go so um let me let me show you what that looks like so if I run oops okay let's see made of TS I'm wondering if it's gonna complain about that it's not sufficient cool let's see so what I want to see is okay there was a question asked I think at the very top and we should see like gpt4 trying to figure out what what year you're referencing extracting the namespace and then we pass that namespace to um the chain and there we go so it's beautiful that's awesome right that's exactly what I was looking for because um here we go so search in 2021 2022 we extracted we now we extract the the years okay then we map the years to the namespaces there in the vector base we search the namespace for results and then this is what we retrieved about risk factors including the source document so I hope this kind of did demystifies what the UI was kind of doing by itself um and so basically yeah there's a front end the front end is already on the repo the gpt4 PDF repo so you can go use that really this is just an adaptation I'm just doing a ton of prompt engineering and just experimenting with talking across different PDFs and then the chat where I basically use the same logic as main.ts that you saw but I'm just using doing it through the API effectively um so it's it's a similar logic I just wanted to test it here first and then once I saw this was decent then I moved it over to the API all right so that's that's basically in a nutshell um I haven't tested it too much but so far I'm able to read three different files with over you know close to a thousand pages of in-depth financial analysis and gpt4 is able to do that and provide analysis across all three years that's decent so let's jump back to the demo and see aha I have a bit more fun and see what's possible so I'm going to use these guys because I think they have trailing oh they have year and year okay three years huh okay fingers crossed let's let's see let's see let's see so let's pick one let me pick um Revenue okay compound annual rate for Revenue over the past three years okay let's see what happens okay I doubt this is gonna work was the compound annual growth rate of Tesla's Revenue over the past three years if this doesn't work I I expect this not to work all right but okay here we go one two three so it's going to look at all three reports it's gonna think through that okay okay ah okay not not bad I mean fifty percent sixty percent it's referencing the the actual files itself wow so this is page 49 uh that's uh that's interesting let's see let's ask uh one more question for the road um you got profitability what what are the what are the kind of questions um so we talked about risk factors management um let's talk about growth rate um based on the past three years um annual reports what is the growth potential Tesla over uh that's just that's just uh stop there uh Warren Warren Buffett section let's see wow yeah this is incredible I I don't know how so talking about okay so expanding production status and uh for context I've only set this to K1 so K1 is just um the number of reference documents to return per PDF ask and uses context and that and that's all I'm doing for now I I could I could actually increase the uh The Returned Source documents as contacts and I'm pretty sure the accuracy is going to jump um go so this is uh very rough sketch um if you're asking for the card just go over here and kind of tweak and play around with it but this is the architecture um that you can kind of use for your own um application um like I said it's just a case of thinking through this step by step and um using GPT to help you along the way so um I'll also make some future changes to this um repository as well so I'll add futures for multiple PDF files as well here at some point in the future but in the meantime if you have any questions just uh shoot me a message on Twitter at mail ocean or you can stream your message on YouTube um I understand this is a bit complicated but just kind of re-watch the video and um I'm sure it makes sense so uh thanks for watching and if you want more in depth step by step details in this stuff I'm going to be doing a series of workshops soon so you can see that uh check sign up for the waitlist uh in the description section as well so um so that's it cheers
Info
Channel: Chat with data
Views: 251,637
Rating: undefined out of 5
Keywords: gpt3, langchain, openai, machine learning, artificial intelligence, natural language processing, nlp, typescript, semantic search, similarity search, gpt-3, gpt4, openai gpt3, openai gpt3 tutorial, openai embeddings, openai api, text-embedding-ada-002, new gpt3, openai sematic search, gpt 3 semantic search, chatbot, langchainchatgpt, langchainchatbot, openai question answering
Id: Ix9WIZpArm0
Channel Id: undefined
Length: 43min 6sec (2586 seconds)
Published: Mon Mar 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.