Chunk large complex PDFs to summarize using LLM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this recording I am going to talk about a technique to parse large PDFs while maintaining the context of the PDF and then I'll show how to summarize large PDFs using technique called map reduce I'll also uh go through a code through which I have implemented this technique so so what was the motivation for me to to create this I actually had two motivations the first motivation was for me to summarize the archive paper on large language model so that I can quickly grasp the concept of the paper so there are so many archived papers on language models like going through each of them end to end it's it's uh I'll not say difficult it's time consuming and the pace with which language models or the generative AI is evolving it is important to save some time uh while going through this knowledge articles and grasping the knowledge so that was one motivation there is one more motivation which I will not be able to talk here that's more confidential so that that was the motivation now the challenges right challenges for um doing summarization on large PDFs primarily three um first of all the PDFs are unstructured how do you like when you chunk the PDFs and I'm not going into um the details of why we need chunking because I'm assuming that is the basic rack pattern and everybody understands rag now it's it has become mainstream but um coming back to the chunking part right so when I Shunk this PDF documents how do I Ensure that when I chunk I do not um I I do not um lose the context of the chunk while um creating those chunks for example if there is a section and I just arbitrarily chunk it I might be chunking it at any position in that section so the next part of the section does not have any context of what has been said earlier right that is one problem statement second is there are complex tables the tables within PDFs right not all of them are simple row column based I tried with tabula it did not work perfectly um the pi PDF Pi mu PDF uh right I could not make them work uh perfectly right so some of sometimes the tables were passed correctly sometimes uh the relationships between the columns and the tables um are completely uh disarranged and it was getting very difficult for the language models to understand the tables so that's when I thought like it probably will require a more a traditional machine learning based layout parsing technique to extract the tables and that's what I am going to talk about today like what I used to do this layout parsing and the third once you extract the tables how do you make the LM understand the table content what is the best way to make the llm understand the content of the table right so this is the problem statement the solution I wanted to work through the solution so that when we go through the code we can come back and understand it better right so I start with a PDF this is a archive document let me actually go to that archive document so this is the archive document that I have taken right which uh very interesting one um there's like you can go through it uh at your leisure time but uh it's uh very interesting where it talks about lost in the middle how language models use long context right so it has sections like it has introduction then section 2 is um language model then section three talks about experiment setup and all and at the end we have the tables right these are the tables module there are other tables I ignored the figure for now because at this point of time I did not find a good technique even the gpt4 which is multi-modeled for those models to actually do a good question answering based on the figure in fact I think it is practically not possible right now I tried some image captioning technique the image captioning models that I tried from a hugging phase were not that great they were not able to explain the images better I think in this space a custom captioning of the images custom captioning model probably will need to be created but as of now I have not seen GPT 4 although being multi-model not capable of answering questions based on what there is in the figure so I ignored this for time I am looking for a solution and I'll make another recording When I get a solution but I'm actively looking for a solution right so this is the document that I have taken uh so after I get the document I use Adobe extract API to do the layout parsing what Adobe extract API does it takes the PDF and it gives me two outputs one is a one is the entire structure of the PDF in a Json document Plus number of Excel documents where the tables are extracted the tables are extracted in Excel documents and I'll soon show you uh how it looks like but the Adobe extract API what what is the Adobe extract API let's take a look at it right so the PDF extract API included with the PDF Services API so I'm using the free tire service now which has some limitations I think it is 500 uh PDFs per day that's the limitation which is enough for me for to show this demonstration right so it's a cloud-based web service that uses Adobe sense Sensei AI technology to automatically extract content and structural information from PDF documents native or scanned and to Output it in a structure structure Json format the service extracts text complex tables and figures right I'll put this link in my recording so that you can go through the entire uh overview of this service but it extracts text complex tables and figures and gives me a Json document to the layout structure let's see how it looks like right so when I extracted it right I extracted it under this PDF documents if you see it gave me this is the Json that I get right this is the Json which um which encapsulates the entire structure of the document so if you look at uh for an example let's take an example right so we um let's start from the top it will be easier for understanding right so if I go to my document right so if I see where this abstract is while recent language models right if I try to find this right this text is there and this is the path right so it is saying it's a paragraph right and um The Heading is abstract right so if now if I want to chunk it based on H1 which is uh the top having uh top header or if I want let's say I want to which I did by uh the second level of heading I can do that as well now I can chunk it by this uh label by looking at this level in fact I created a custom boxer for that so that will ensure that when I now chunk by looking at this label I'm taking an entire section itself right which based on this document uh it seems that that every section is having the entire context for that section so this is how uh it it uh extract the extract API this is what it did it give me this it gave me this and all the cables that are within the document are extracted as excellus file right so for example if I take a look at this one right this was the first table it let it open um so if you see it says model book uh e right so let's see what that is sometimes it may not be the headings may not be all correct but uh it says model closed book Oracle right it did it correctly right model closed book Oracle this is how I get the this table similarly um the other tables are also extracted this is all these tables are extracted as uh x uh Excel files right for example if I take a look at this this is this uh this table here this table here so very I think it did a very good job in extracting the table in Excel file right so that is what the Adobe extract API gave me then I wrote a custom Json parser to parse the Json chunk this PDF into respective sections for example I had a separate introduction chunk language model chunk multi-document QA chunk and if we let's take a look at that so when I chunked it so if you see now I'm able to chunk the abstract and the introduction together then the section 2 in a separate chunk Section 3 in a separate chunk section 4 in a separate chunk right that is what this custom Json person that I wrote wrote did that right I was able to now chunk it into contextual sections right or context aware chunks right so this was my extraction part so after extracting this I then my my actual the summarization stuff starts right so for the summarization I used land chains implementation of mapreduce summarization which is this option here right and I'll go through it when I go through the code I'll go through it uh in details explaining what each chain is doing there are three chains that are used here explain what each chain is doing but before that let's understand the flow so I used a text loader to convert all these sections into the document structure launching document structure then I create a map prompt now this map prompt I first tried in land chain Hub I'm going to show that to you also so the map prompt takes each of this sections individually created a summary of each of the sections which it then gives it to the reduce prompt reduce prompt takes the summaries of all of them and creates one final summarized version of the PDF that is the whole flow I hope you are still with me this is probably this probably will be a little longer recording there are multiple Concepts that I am going to show multiple things that I am going to show you uh but I hope it will be interesting so the things that I'm going to show you now in the code is the extraction part the summarization part and also how you can use lunch and hub to test your prompt before actually using it you know program right let us go into the code now and we'll go through it very very slowly so that we understand kitchen every step right so let me start with the main section right what did I do right so this is where my input PDF is right when I run extract API it actually gives me the Json and the tables that I mentioned in a zip format so this is the output path of the extract API in my custom parser I first unzip it into this location and then finally I all the chunks that I create I put it into the stock directory right step one I do a parts PDF where I take the input file path which is that PDF that I've showed earlier right now if I go to the parts PDF right so if you see this is what the extract API does right so I have created a account together get the client key and uh client secret key in the client API key right it requires those two keys you can get that from the free Trier right so I um take those I create the credentials then in the extract PDF options I say that I want to extract both text and tables from the PDF right and I want to save the results into the output path right this is where it saves it as a zip then I unzip it into my um the unzipped folder right that's so that's what I did right I unzipped it so first I get this uh ZIP file right now when I uh this code when I unzip it I get the Json which has the layout of the PDF and all the tables right after that starts my chunking exercise right so for the chunking now I open the Json document right I read the Json document and the path that I showed you right those those are those I captured in the elements list right so when I say uh parsed file equal to open file name if I open this right so let me actually um go there and show it to you so let's so when I parse it oops right so the elements key for the each element right from each of the element I look at the path right what the path is right let me go back to the code right so if I I mentioned that I did the chunking by the second level of header right I find whether it's a second level of header right if it's a second level of header there is some logic here to see if it is a first time header don't create another file don't close the previous file right there's a logic to ensure that I I chunk at the H2 level right so if I hit the first header I don't I keep on writing when it comes to the second header first I close the previous file I create another file to take this second header right and if I find a cable and this is where I have to use some a regular expression so if I find a in the path if I find a table I and the table can be like if there are multiple tables right it can be table uh within um square brackets one two three four right so this is what I'm searching like if I find a match then I read the XLS file I read the XLS file from this tables folder because it has now put in the tables folder I read the XLS file and convert it to markdown and that's very important to note my markdown so what first of all like when I converted it to a CSV and try to create a chunk and had uh GPT 3.5 and gpt4 to read CSV I found a lot of inconsistencies sometimes it is able to understand the CSV format sometimes it is not able to but when I converted it to markdown I was able to find out although GPT 3.5 was still inconsistent GPT 4 gave me a very consistent answer um with the markdown format right GPT 3.5 was failing not much but I'll say 1992 95 percent of the time it was good five percent of the time it was failing but the markdown was giving me a very consistent results with jpt4 right so this code this parsing code uh helped me to create the chunks that I showed now the uh PDF is chunked into this uh 10 chunks right which are now context ever chunks right and we saw them earlier each of them is at the um H2 level second level of header right so that is the first step right which was to parts and chunk the file step two is where then I used text loader of length chain to read all this chunks into a document schema of Lang chain right so this one gives me so when I uh run this I get all the chunks as a document schema in the document list this is a document list step three is where my summarization process starts now right this is where I create an instantiation of uh the open AI model I gave my map prompt and I would see here I have pulling it from the length chain hub let me go to the line chain Hub and show you the beauty of that feature so if I go to Lang chain hub right this is where my prompts are the map prompt is here right map chain so without writing a single piece of code right I can first check whether the prompt is working correctly or not how did I do that so I wrote this prompt and then I took a chunk like let me take a bigger chunk right let I took a chunk like this right and I asked it to give it this document and I asked it to create the main theme summarize the main theme right of it right so this is what I got from the map chain right so this is the map chain part okay this is the this is the The Prompt I pulled the prompt from The Hub I created the map chain using this prompt then I'm creating the reduced prompt now what does the reduce prompt does if we go back to the diagram the map prompt for each of those 10 sections it has given me 10 summaries the reduced prompts will now take those summaries and create one single prompt or one single summarized document right so the reduce chain prompt let's take a look at the reduce prompt as well right what was the reduced prompt in fact I think we did not look at the map prompt also so the map prompt was you are a helpful chat bot and an expert in extracting the main themes from a given document you have provided a set of documents Below based on this set of documents please I please identify the main themes right so this map prompt um with all those documents identify the map themes created the summarize of each of the documents separately after that I gave it to the reduce prompt the reduce prompt is like this you are an expert in distilling content based on a set of summaries of main themes below is a list of Doc summaries take this and distill it into final Consolidated summary of the main themes right so here the whatever the map chain has produced I give it as an input to the radius change and the reduce chain finally gives me the final uh summarized portion right but let's see how we how I stitched it together in a um in a reduced document chain right so I have the map chain this uh created here reduce chain created here then I created the reduce document chain where I passed the so the combined document chain is actually uh taking the reduced chain so the reduce chain is using a staff document chain what staff document chain is like all those document summaries uh right will be passed together into the reduced chain so that it can create a summary right so the reduce chain um is passed as llm chain to the combined documents chain so when I create the reduce document chain I give it the combined document chain which is actually the reduce chain uh I then give the collapsed document chain what is what is this collapse document chain now when the reduce chain tries to look at all the summaries and combine it right there may be a situation where it may see that the the token length of all those document summaries goes beyond the token limit of the model right so I here I took GPT 3.5 16k so the max token is 16 000. what combined document chain does before it helps before the reduced chain can take it over the combined documents change uh which is also the same prompt right takes those document summaries that I have given now and recursively takes 16 um uh thousand uh of those tokens say and recursively further summarizes all of them until it ensures that the final output from the collapsed document is sixteen thousand and that is what is then Mass to the the actual reduce chain so that it can summarize based on those 16 000 the the content of uh the 16 000 tokens right so the reduced document chain therefore has a has a collapse document chain and the combined document chain collapse document chain ensures that the final document summary that goes to the reduce chain is not increa not uh going beyond the threshold of the the or the token Limited at the model has right and then finally I create the mapreduce document change this is my final chain which combines the map and the reduced part right so I give it I say that this is my map chain this is my reduce chain right this is the reduced chain the map this is the map chain right and the variable that I am going to pass to the uh map chain is the documents right this is where I'll pass all those chunks and finally when I run the mapreduce chain I'll get the uh summary of of the document right one caution here this is a relatively expensive uh chain uh I ran it three times it cost me around 0.78 cents for those three runs I'm not going to run it again like I wanted to save some uh money in my uh subscription but when I run it the output that it gives is like this you see so what it did is it uh read all those chunks and based on the chunk the final summary is the main themes identified from the provided set of documents are in case one two three four like these are the things summarization of the the whole content of the PDF and finally it says overall the main theme revolves around language model performance factors affecting performance evaluation protocols challenges in multi-document question answering and token count and performance analysis and if I read all this six or this Five Points I get a gist of what the document is talking about right so that's all I wanted to share I know that this is a relatively uh longer recording compared to what I have been doing so far but there are a lot of things that I wanted to uh talk about so that's why at the cabin uh let alone thank you bye-bye
Info
Channel: Rajib Deb
Views: 4,482
Rating: undefined out of 5
Keywords:
Id: FZFB92UnXQ4
Channel Id: undefined
Length: 29min 58sec (1798 seconds)
Published: Sat Sep 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.