How to Compare Multiple Large PDF Files Using AI (w/ Jerry Liu, Co-Founder of LlamaIndex)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey it's mayor from chat with data so let's imagine you have two documents right these could be two PDFs this case I'm looking at lift SEC 10K uh a ton of text uh structured data in terms of tables and it's very long it's hundreds of pages of PDFs and I've also got Uber as well so you've got here a financial tables and text and it's very very tedious to get through this so let's imagine that you have these two do doents and you want to compare them or you want to ask uh complex questions that you want the AI model to respond to so for example you might want to compare the different segments between the two companies or the two PDFs which is basically the ability to scan through both documents and extract insights or you might want to compare certain part parts of the financial statements now if you've watched my previous videos about how to chat with 1,000 p H PDFs we looked at Tesla that's a good case where you typically focus on one you one particular document now let's say you try this that approach in this case you might run into issues so let's jump in here this is a naive approach which is prone to hallucination which is you are going to probably run into situations where the model was generating alp that you don't expect so U and it's not accurate the reason for that is let's think about this intuitively so this is kind of just a simplified version of my architecture in my previous videos but in this case the user asks a question okay compare the key RIS between Uber and lift so you want to compare the wrist section in the 10K for lift and then you want to uh between lift and Uber so you want the model to extract the risk section here the rist section here and compare now if alarm brows are ringing in your head in terms of why this might fail through the naive approach covered in previous videos which essentially is um we've taken the documents we've chopped them up into chunks we've embedded them we' stored them in a V store and in this case here we're saying okay we take the query we turn the query into embedding we perform similarity search in a vector DV where you store the embeddings of the original documents and so now this is where the hallucination starts because how is the model supposed to how is the retrieval mechanism supposed to know what exactly you're trying to pull out it's going to be confused by the questions being the question being asked um because the questions being asked does not necessarily mean it's going to discriminate in terms of the relevant retrieve docks that it's then going to pass its context to the model which is going to generate an accurate result so if you see here the retrieve docs in this case could have three of the four chunks let's say your top K is four and three of the four chunks are literally chunks from Uber only one chunk is coming from lift remember there's no way that the vector DB where you sto your embeddings would have known what chunks of embeddings are representing Uber or lift without some potentially some advanced metadata filter that's a different story for a different dat right now you can see this is problematic Because by the time you get these retrieve chunks from your documents the context is going to be polluted and the output is not going to be correct so for complex queries where you compare in documents we need a more advanced approach and this is where a tool like llama index can come into play um so here basically we want to be able to ask a question like compare Revenue growth of uber and lift from 2020 to 2021 and you can see the breakdown of the model and it's going and eventually comes out results where it says the revenue growth of uber from 2020 to 2021 57% and then revenue of lift was 36% so let's jump in here okay here we go Revenue was up 57% this is the Highlight for 2021 so you can see this accurately came from the Uber dock and so ultimately it does the same calculations over lift the entire lift document and here it concludes that Uber had a higher revenue between uh than livt between uh from 2020 to 2021 uh so how is this done how does it work under the hood and how can you build it for yourself well I think it's probably best to hand over at this point to the founder llama index Jerry to explain this my name is Jerry as may I said it's great to be here I'm co-founder of llama index llama index is a framework to help you build LM Maps over your data and we're super excited to walk through some of the Core Concepts today as well as teaching you how to not just build prototype uh rag type applications but also uh handle more advanced uh questions over more complex documents awesome all right let's jump into the slides so sharing the slides right here um I figured this is a pretty quick overview and we'll bounce between both both the slides and also the notebook itself um but the goal of this talk is to help you understand how to build rag not just over uh simple use cases where you know in about like five lines of code you simple up you set up the simple stack but you can actually iterate on the algorithm to be able to handle more advanced queries over two Advanced use cases one is over multi-document comparisons and the other is over embedded tables and PDFs so in the first section we'll walk through just how the basic rag stack works and maybe some general sense of like the the types of like failures that you might encounter when you build this um so for those of you who are already familiar with this uh this will be pretty uh easy to understand um but basically what the it's going on in the current retrieval augmented generation stack and of course Mayo has made a lot of great videos on this is that you load in a document right an unstructured document uh from some data source it could be a PDF it could be an HTML file markdown file or it could be an API call you load in some set of documents and then you want to use some sort of text splitting algorithm to basically split this into a bunch of chunks um llama index basically offers a variety of toolkits around this we have toolkits to both load in data and also parse data into a bunch of chunks uh and also tools to you know take these trunks embed them and then put them into a vector database um once the data uh is in a vector database then you can basically um uh set up some sort of retrieval a M mented quering over this data right so you first want to retrieve over the vector database by fetching the top K most similar chunks and then you want to plug this into your LM synthesis module to get back a response and so if you have ever built some sort of question answering or chatbot uh and and this whole idea of chat with your data like this stack should be pretty familiar to you uh one of the nice things in Lama index uh as a framework is that it's kind of like tailored to help you set up Stacks like this and you can basically do the simple stack in about five lines of code um these uh sections just kind of walk through this in a little bit more detail and so you know the this really is divided into two categories one is data injection and parsing and then the second category is retrieval and quering um data injection and parsing is really this like type of ETL for your unstructured data for use with llms uh and and in this section as I mentioned you want to do some sort of tech splitting generate embeddings um store in storing them in a vector database uh once they're in a vector database uh you want to do some sort of um look up to retrieve the most similar chunks given the query and then take each chunk and then stuff it into the LM um and then there's going to be maybe some complexities if like the setup trunks actually overflows the context window of the LM and so we have abstractions that help you deal with that with thenl index so now that we've walked through the basic rag stack Maybe may we can kind of think about okay um well this basically breaks it down into again retrieval and synthesis so dealing with like hallucination and failures um now that we walk through the basic rag sack we can kind of think about some more challenging use cases um we'll uh in the next few sections we'll go through some examples where the current ragsac actually just cannot handle certain types of questions and so we'll either return an incomplete answer or an answer that is incorrect uh and so these specific use cases uh involve kind of like um uh settings that are a little bit more advanced and one of them is this idea of multi-document comparisons uh how do we actually ask more complex questions over multiple documents what the existing stack allows you to do is ask questions over specific facts within a single document um or within that's located relatively in a single location if you want to synthesize two disper bits of information from two different documents um that becomes a little bit more challenging to model with the current stock another use case that a lot of our users have talked about and relate to is this idea of kind of having complex document objects if you look at a single PDF a PDF can have a lot of text but also a lot of tables within that document um it can also have a bunch of like images charts graphs and those things and so how do we properly model this data right just on a data structure side and then also how do we Define the right retrieval algorithm to actually properly you know uh combine like inter leave uh complex like structured and unstructured data within a single document and so we'll talk about both these use cases um in the first case uh we'll talk about this idea of multi-document comparisons um and so this is a very classic example is financial analysis uh where let's say we want to look at the SEC 10K filings for both Uber and Lyft in 2021 um and let's say the question that the user wanted to ask is uh compare and contrast right the customer segments and geographies that grew the fastest or compare and contrast the revenue growth of uber and lft anytime you ask a compare and contrast query uh something that requires comparison across different documents you're going to want to look into similar sections in both documents let's say you have both PDFs uh and both sections have you know a specific section on revenue or customer growth um and and the issue is that when we do topk retrieval we'll show this in the notebook in just a bit we do topk retrieval over all the 10K chunks it doesn't always work um and so what what's going to happen is shown in this diagram right here is like you're going to ask this question um let's say you know you uh all your 10K document trunks are stored in a single collection in the vector database and let's say we do we look up the top four most similar chunks given this question there is no guarantee you're going to return the relevant sections from both Uber and lift um you know at the same frequency in fact one thing we found is that sometimes maybe all the trunks will be from Uber all the trunks will be from lft or you're going to have some uneven balance right and and maybe like one section for like the lift trunk doesn't even have to deal with the actual um uh like the related Contex at hand like it might not actually give you the the kind of um Revenue growth or usage that actually allows you to answer this question um and the reason for this is just like when you do embedding lookup you're kind of just praying that like somehow the embedding similarity of relevant chunks matches and it's a bit less structured so one idea that we're going to propose and we'll walk through the notebook right now is what if we um Can index these documents instead of just like throwing all the chunks into a single collection what if we index the document separately right so we tag Uber as like a separate collection Lyft as like its own collection and then given a question that is a little bit more complex like compare Revenue growth at Uber and lift what if we can kind of break it down into two different questions like describe Revenue growth at Uber in 2021 and describe Revenue growth at lift in 2021 we'll take each sub question um and ask it over a subset of the documents in our overall collection and so this question will go to the Uber 10K collection this collection will go to the lift 10K collection and then we'll do a retrieve right within each document for the question and then after we do retrieval within each document then we'll combine the results at the end to actually answer the final question so it's a bit more of like a structured query planning process uh and we we do show that this kind of gives you slightly better results than if you just tried doing the topk retrieval method and and great I I think the next step would be uh let's walk through a notebook of how you can actually do this within M index and while you're doing that Jerry um just a quick question for someone who might be wondering how do you define what an indexes uh what does that mean that's a good question um so we actually do have um this is a great segue into maybe just some overview of just like the overall categories of what ladex has to offer um if you take a look at the overall rag pipeline uh which I showed in the general diagram um it's really kind of like a few main components it's you want to load in your data from some data source you want to transform and parse it and load it into a data structure uh and then you want once the data is in some sort of data structure of storage system then you want to query over it um and index basically falls into this category it's basically a view of your data uh in different ways um and so one example of an index which is in most popular one uh is a vector index and so I've been saying Vector databases all this time that's basically just indexing your data via embeddings so that um when you have embeddings associated with all the data you can look it up via topk similarity search uh but you can also index your data in other ways too for instance if you index like the relationships between data you can then kind of put them in a graph database if you index them with keywords or any sort of um structured metadata you could put them in like a structured database as well right so it's kind of like a family tree effectively you've got your grandparents and so on and so forth so you have siblings and relationships between what you call nodes right exactly yeah it's basically just a view of your of your data and the overall idea here is you have some way of representing your data and that way of representing your data is stored in some storage system and because you represented your data in a certain way um then you can actually do more advanced queries over it and so in this example over here like we show that if we represent each data a little bit separately like we we kind of treat Uber and LIF as like separate documents then we can kind of do more interesting query analysis over these documents by breaking questions down over each document right so it's not like you know for someone who's not familiar with this they might think when you compare in two different documents your the AI is like scanning you know one document and then scanning the other document uh and then like a human being would what you're trying to explain here is it's not that straightforward it's almost um treating them separately um is different uh data sources initially yeah um I I think the one of the arguments that we try to make is just um the if you want your AI to interact with your data uh in the right ways uh you have to kind of think about carefully how you define the data that's represented for the AI uh and so in this existing case where you have a lot of um like you have like all the 10K documents and then let's say you you chunk and parse them and throw them into one collection uh it's just a little harder for the AI to go into that and actually try to reason you know does this correspond to Uber does this go to lift and and how do I compare the two together um and so having the right structures over your documents does help the AI better reason over how to analyze your documents and last thing just in term not jumping too far ahead but for people who want to do this example you know but instead of comparing customer segments they want to compare something in the financial statement right so now you're going into the territory of comparison but now you want to potentially tap into Tabler content within each PDF yeah um so exactly I I think tabular content is is in the next section right after this uh and I think right now for just for Simplicity we can assume that um both documents are just kind of unstructured text uh without too many embedded tables uh of course like 10K filings have a bunch of tables in there but we can kind of forget about that for now but then we can uh the next step after this is okay within the document you have a bunch of these like structured tables uh and of course it's important to par that how do you model that properly yeah yeah just highlighting that is possible to do that so yeah we can jump to the cab so um for people who are not technical in simple terms what is the Google carab and what is what are all these uh these crazy things that they're looking at right now yeah so a collab is just um it's like a python notebook uh and so the nice thing is it's hosted on the web and you can share with anybody so that anybody can just go ahead and run it and so even if you're not technical I think the only thing you need to do is fill in your API key uh in the section above which I'm not going to show because I have my own API key in there um but you just have to fill that out and then run all the cells and you basically don't really have to think about it you can just run through all the cells so it's a nice way of just like um packaging a script or a demo uh into something that's terrible yeah this is just the code version of what's kind of going on on the Hood um exactly so and if if you're not too sure you can always copy and paste chat GPT too I'll be explaining what's going on so exactly cool great so let's walk through some some of the basic demos um I think this uh I'll probably skip the description of some of the Imports and we just go through each section and and talk about at a high level what's going on um a lot of the stuff is also in the docs too uh and then maybe um I'll try to add some annotations onto this notebook and this will be sharable along with the slides um and and so we can link that in the description so you know at at the very Basics we have our own LM abstractions uh and the first thing we want to do is just like um you know uh initialize an open AI uh llm we're going to use 3.5 turbo uh and then we're just going to Define this like overall bundle called a service context it's just a abstraction that contain that's a container or config for your LM embeding model uh and like chunk size and other things so now that we have that um this next script just downloads the an Uber 10K from Dropbox uh we we just have that pre-cache somewhere um and then I think as Mayo has shown um we like we can go through what the lift and Uber 10ks look like uh and if you go into here they're basically you know hundreds of pages long um and if you're familiar with financial analysis you're very familiar with this but you just go through you know the business overview uh risk factors like and then a bunch of tables talking about Revenue growth costs and all these things right and and so um like they're just very complex documents and of course like you know if you're a financial analyst you're very used to kind of analyzing this yourself and here you can see there's like a bunch of tables embedded within these documents as well um so what we're doing here is this is a very simple L index obstruction that's just a convenience wrapper to load in basically any file type and so here we're just going to load in the lift and Uber Docs um I'm just going to show you oh sorry go on yeah when you say loading just to clarify um what what you're doing here essentially is uh embedding the documents uh so not uh like that that's um that's the next step right after this so all we're doing here is we're just uh running this through a PDF parser and so I I know it's taking it's uh just finished but this is basically um what it looks like and so you can see it's like um we we extract out like a set of uh pages one uh basically each document corresponds to like one page of this PDF and then if we just print the content of this you can see it's just a dumo playing text um and so there's a bunch of different PDF parsers you can try out um we have about like 10 or so um on our llama Hub website which is our source of loaders that you can try out um and so this is just using pi PDF there's also like P Me PDF we integrate with unstructured doio they're great um we have like uh deep doc detection and a few others as well cool great so now that we loaded in the data um the next step is to Index this data and store it um and so Mayo this is exactly what you talked about where um you know as a convenience uh we're just going to we can actually just do this in one line code and so you just see Vector store index. from documents with the set of documents that you feed in um and the idea here is really to set up that Baseline rag stack uh like what I just described and so what this is going to do is under the hood as it's running this it's going to trunk stuff up um it's going to embed each trunk and then it's going to put it into a simple in memory Vector store so for those who haven't watched my videos uh embeddings essentially is you're transforming your text into numbers um and these numbers is things the computer can understand and wrong computations on including uh finding similar parts of your document uh related to the question you're asking exactly um and then once we actually Index this what we're going to do is we're going to get this thing called a query engine from the index uh and what this query engine gives you is um it's essentially an interface the query the data that's now stored within your vector index uh and so this query engine just is an interface to you for you to start to ask questions and then when you ask questions like we're going to go through some of these questions uh over here we're basically going through that second step in that rag architecture that we talked about which is this piece like we're going to do retrieval from your vector DB retrieve a bunch of chunks and then we're going to take those chunks and then actually feed it into that outline uh for the sake of this tutorial I'm kind of abstracting all of this into a single line of code and so there's a bit of uh you know magic going on under there uh well it's not really magic it's just really these two steps um a quick plug for we we we did if you want to really understand how these things uh work under the hood we did come out with like a lower level set of tutorials on retrieval and synthesis uh to basically help you encourage how to build like rag from scratch so not using these like high level abstractions but more the lowle abstractions so you can learn for yourself how that works um now that we have this base engine now we can start asking questions over it um and the idea is to show you know the capabilities of what the Baseline model can do and so we can ask like you know we have both lift and Uber doc chunks in there we can ask questions like oh what are some of the risk factors for Uber uh we set the top k equals equal to four and so this means that we're kind of retrieving the top four every given point here you can see that we get back a response like some risk factors include you know violent inappropriate dangerous activity those things uh and then really quick you can actually go in and take a look at sources um yeah very important actually that you point that out because uh people will want to know where exactly it's coming from in the document 100% so you can see here um I'm just going to fetch uh show you like here's a number of source notes um you can see there's four Source notes and this is just an example of the First Source that the document comes or the the answer comes from and so this is just an example chunk within the document and just just to prove maybe you want to jump back to the document and just prove that this came from the document for those let's see I think I'm gonna just drivers consumers Merchants yeah control F I think it's here yeah oh nope I'm just gonna copy more of this it's right here so we are not able to control or predict the actions of platform users and third parties either during the use of platform and so here it's basically um this is in the if I'm not uh mistaken in the risk factor section yeah amazing good y so okay now we have this Vector index now let's ask a compare and contrast question okay let's compare and contrast the risk factors of uber andyt and then we're going to run this um and I think I might have already uh betrayed what the results are going to look like but it basically says I'm sorry but I cannot provide a direct comparison and contrast of the risk factors okay so we asked why why is that um when you kind of go into some of the sources you see that you know like the in the first step like you you do see lift like 2021 right um and and you see a page label this is like the First Source when you go to the Second Source um you also continue to see lift right and then let's go to a third Source it's still left um and then you go to the fourth source and it's still left so so like you you basically just like dumped everything into the single collection and you just like fetched a bunch of random chunks and and of course like just like you know even with the metad data right saying that this is from lift like you're just not able to figure out the correct answer for this so this motivates a little bit more of like a structured approach so this is a nice segue into the sub question query engine um where you know this query engine uh will basically do the following and and we'll show you how to set this up first we're going to treat different documents as um we're actually just going to index them differently uh you can do this in a variety of ways uh you could you know technically they could be in the same collection in for instance a vector database like pine cone or we or chroma uh but they could be in a different name space but regardless we're just going to treat them as like SL tool so we're going to have a lift index and an Uber index so yeah as you're kind of loading all of this um for the more technical viewer I guess they're kind of wondering do all these different Vex stores have different um terminologies you've got name spaces with pine collections with chromer um and so I guess off top of their head they're probably wondering are all the embeddings going into the same uh name space for example or the same collection or do you have to basically construct your code to embed them separately into different um different collections or name spaces so that's a really good question I think that's one of the nice things about our obstructions which is that you um the index is basically just a view over the storage system and we integrate with all these storage systems and so it the short answer is it could basically be whatever you want it to be so you could have them like each of these be a separate collection uh or you know if you're using pine cone they could each be uh under different name spaces um or they could actually just be like different metadata filters uh in the same in the same table and so we we don't really sure how you config that here but you can you can do that um great so now we have uh Uber and lift index um and the next step is we're just going to similar to before get back a query engine for both Uber and L um and now we have separate query engines for LT and Uber right and we see the similarity top K is equal to two and so now you know if we ask questions here it's going to be about lift and if we ask questions here it's going to be about Uber The Next Step here is this starts to um uh as an overall concept um get into like a little bit of like a gen reasoning um basically using llms not just for the final synthesis step in rag but actually helping with a little bit of automated decision making and here what we're going to do is we're going to Define each of Lyft and Uber as tools right and the tool is going to have like a name and also description so here the lift tool is going to say provides information for Lift financials for a year 2021 and here the Uber uh tool is going to say provides information about Uber financials for the year 2021 so we're going to find these tools and give it to this highle query engine which you know is basically like a mini agent um and and and kind of but it's it's like not really the full like react agent if you're familiar with that uh or just like any sort of full agent Loop um but the idea here is we're actually going to rely on some sort of automated decision-making process by providing this tool metadata up to this like overall query engine that can make decisions and so by providing these as tools with names and descriptions we can basically start implementing this approach right here right like given this top level question uh you can actually figure out how to break it into sub questions that correspond to specific uh subsets of these tools to ask over and here we see that we initialize our sub question query engine so now that we initialize it um let's run some example queries um one example here is how do we uh compare contrast the risk factors of uh Uber and lift basically the same question as before you can see it's basically actually just doing what what I said like you have this overall question uh and it's breaking it down into two different sub questions and uh what are the risk factors for Uber what are the risk factors for lift and you're able to ask that over each document actually and now you see you have like these two different answers like lft and Uber and then you actually get back a final answer um that is coherent and so it says the risk factors for both Uber and lift include potential criminal violent inappropriate or dangerous activity by platform users and etc etc so unlike the previous example where couldn't answer like you know here here we are giving some sort of answer here's another example um let's just say like tell me what was higher Uber's Revenue growth or lift's Revenue growth and using the text uh explain the reasons for for the revenue growth and here we just run it across both the base query engine as well as the sub question query engine um and so I think it just finished running over here and here is just going to break it down over some sort of the sub questions you can see the sub questions generated is like what was Uber's Revenue growth what was less Revenue growth what are the factors that contributed to Uber's Revenue growth uh to less Revenue growth and then you're getting back answers to all of these right because you're breaking it down into some sort of query plane and then you're actually able to get back an answer so now let's actually compare the responses okay I mean so in this setting um I think it you know what's kind of interesting is I think the answer actually um like changes a little bit I this is the example from the the base uh query engine which actually does give you a response it says like Uber's Revenue growth was higher than L Revenue growth um the text explains that you know uh Uber's Revenue growth is sh by factors uh and LTS Revenue growth depends on this uh and and the the sub question query engine of course like gives you back a similar response as well um I think the the catch is like I think there's a little bit of like stuck castic in here too because like I think when you take a look at the base response like sources um in the previous answer it actually said like I couldn't give uh context I I basically like I said um I didn't have enough context to talk about the reasons for lifts Revenue growth right and so we can try to inspect some of the sources uh of the base response uh to see see wi so here we see that that's Uber um this is Le and then this is Uber and this is also Uber uh and so actually you got like three Uber results and and one left and and actually sometimes we found that when when you do this the language model gets like confused half the time and so in this case I actually was was right but because like the sources are imbalanced you actually see that sometimes like that the answer isn't isn't fully correct but what happened obviously different to the previous case was that it was able to go and deal with each document separately and then do the comparison now for the more technical person here asking the question of how is this different from using a um uh like a function calling kind of strategy um for those who don't understand what I'm saying it's just a way um that opening eyes basically fine-tune their uh models to allow you to allow the model to basically um extract um specific information from um the query and use that to do other things um because your query engine structure uh with the description looked similar to function cool I think at a high level um so this is a great question um at a high level like this starts to get into adran Tech behavior um and like this is worth like a longer discussion on kind of like the whole spectrum between something with zero automa reasoning and something with full automative reasoning so the thing about open function calls is that um there's a few concrete differences with this one is that here like all the query plans are generated in parallel uh and so it's really design designed for this case of like multi-document comparisons being able to look at everything independently uh whereas a function call uh by by default I think it relies on some sort of sequential Loop and so it's going to be a bit slower um but of course like a function call can take in this complex question do some sort of train of thought prompting and then actually break it down I think in general we've noticed that the more flexible the agent Loop is once you go to the right of the spectrum to function calling rea back Loops um the technically the more flexible it is but also the more prone it is to failures uh one thing we found with weaker models like 3.5 turbo on function calling is that when we ask like complex questions like this it starts a lot of times it starts giving like um uh it starts just like uh iterating on these calls even when it shouldn't uh right and and so it starts getting into Loops basically uh and for some reason it's just like a little bit less reliable um so that's one the other piece here is like you know regardless of whether you're using query planning or some sort of train of thought reasoning um I think one of the points here is that one of the nice things um uh that we do is basically mapping each sub question to the specific subset of data that it corresponds to so here I describe Revenue growth at Uber in 2021 it Maps it to Uber's 10K and asks it specifically over this document and so the idea is that if you want to build this yourself too you definitely can um I would recom like the part of the goal I'm trying to like teach is like you should try to um not just like do Chain of Thought over your data or Break It break it down break down the questions you should also try to select the relevant set of data that this question corresponds to and that helps to increase reliability uh because sometimes like when you break it down into these sub questions and you ask it over the whole data you might again get back like hod podge of information from like different sources and you might not actually be able to um synthesize the right answer or it might hallucinate given the the set of sources so this is just like a structured approach to give in a question break it down and also restrict it to the subsets of data that corresponds to to give you back an answer cool and what would you say is the um the downside potential downside of the strategy yeah um some like basic downsides is kind of depends like what spectrum you want to uh look at it at um one is if you care a lot about latency and cost uh this does increase your cost uh and does increase your latency a little bit because we're breaking it down into questions um we do async ify all of these so all these things are getting asked in parallel um but there is like kind of two extra LM calls right we're first breaking it down and then we're answering each question and we're synthesizing everything at the end um yeah the other piece here is it's I mean this is kind of this is designed for multi-document comparison you're not going to get like AGI from this from this engine like it's not just going to be able to solve like arbitrary tasks like find me like how to deal with like World piece or something this is very oriented towards like uh comparisons of for like financial analysis or or other settings right right so but yeah I mean this can work for other types of documents um other PDFs um that people have they want to compare and I guess it can work with three or two or more documents as well yes um this is just two documents we have an example with three you can do an arbitrary number yeah cool awesome so I think that was a pretty good recap of how to effectively compare and contrast um documents uh which is a complex approach um especially if you want to do that for um large you know documents like annual reports you got other policy documents even legal documents as well um Jerry anything else you want to say to to wrap off wrap this up no I think um the one thing I'll say is um like this is obviously just uh some basic qualitative benchmarks and so for the sake of just like kind of articulating what the development process of this should be um I think there are a few things that um uh you should definitely keep in mind one is being able to Define some sort of evaluation Benchmark uh and and being able to Define like actually okay here's like my data and here's actually the candidate set of questions I want to ask over my data and if some of it actually includes comparison queries let's actually Benchmark it right like before we even do use the sub question query engine let's just use the um basic stuff uh Define a set of questions and Define some metrics to measure it over and it's only like you know when that metric doesn't actually meet your Quality Bar that you should try iterating on some of these more Advanced Techniques like the sub question query engine and so that's one thing that we didn't get to because we mostly looked at qualitative examples but it is is quite important so I do want to kind of point that out awesome all right so we're going to have the links to this uh in the description alongside information about llama index and more uh thanks Jerry thank you

Info

Channel: Chat with data

Views: 11,371

Rating: undefined out of 5

Keywords: gpt3, langchain, openai, machine learning, artificial intelligence, natural language processing, nlp, typescript, semantic search, similarity search, gpt-3, gpt4, openai gpt3, openai gpt3 tutorial, openai embeddings, openai api, text-embedding-ada-002, new gpt3, openai sematic search, gpt 3 semantic search, chatbot, langchainchatgpt, langchainchatbot, openai question answering

Id: UmvqMscxwoc

Channel Id: undefined

Length: 42min 41sec (2561 seconds)

Published: Thu Oct 12 2023