Extract Table Info From PDF & Summarise It Using Llama3 via Ollama | LangChain

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello guys I'm sudas qua and welcome back to data science Basics this is the second video in the on structured playlist in the first one I showed you the basic things about un structured so best tool for getting your data ready for the rag applications and as mentioned in my uh previous video in this video let's go and extract the information from the PDF and mainly with the tables so how the video will proceed as you can see here this is the in to in flow at first we will have the PDF we pass the PDF into the partition PDF of the unstructured I will be showing you three different ways to do that and we will be extracting the information from that particular table we convert the table into the HTML and get summary using the Lang chain in Lama 3 via o Lama and finally if you want to also convert that particular information into the pandas data frame to do the further exploration I will so show you how to do that so yeah this is the flow and the code is already in the GitHub repository if you want to follow along with me uh just clone this YouTube stops and this unstructured table extraction from PDF is there let's get started okay I have already cloned the repository here and there are all the folders so I have already opened this in the vs code here so let's go through it this is the same links that I have mentioned in my previous video also so and first what we need to do is install the necessary packages so this is all we need and after that is installed as you can see here just this is just for the warning control and uh these are the necessary Imports and this is for the API stops only and this is for the core un structure package right this is the normal import things that you can get from here so this load extension Watermark if you are new to this if you haven't watched my previous video I use this water water mark package in order to show what version of the packages I'm using in the in the demo so you know which versions to follow if something does not work right and also this partition is really good way how you can work around with un structure right so if you want to know what are the different things that it contains as you can see here you can just use the simple help function and it prints all the necessary things for you now let's go and load the PDF uh so our first step is going to be implemented here I'm going to use the same GPT for all PDF uh this uh PDF and the file name is PDF and you can just pass this into the partition PDF that's it you don't need to do any other things just three lines of code we do the trick and once this is done now the elements is a list of all the elements present in the pages of the P PDF document so so here we are not just extracting the tables but all the elements present in this particular PDF I'm showing you this before extracting the particular table information more precisely just to show you what are the informations that gets extracted by these two different ways so as you can see here there are the elements and inside the elements this is the list of the elements and you can see there are 134 different elements and if you want to go and pick what is inside it you can just go through and look this and now I will open this in the text editor but before showing that also you know that only the title on structure text and the narrative text is being extracted as of now although there is already the table information being extracted so before going to the next step I will show you what sort of information is being extracted from the table by this method so if I go to this text editor I can I I will be seeing here all the ele M right if I go now to the PDF I will go to the table information right so there is only one table and this is the table information that we want to extract let's see what is being extracted right I can just copy this this here contrl C I will go here I will do contrl F I will paste it here so as you can see here it extracted that information already as a uncategorized text but we want this to extract as a table right that is what I will show you in the next one but here it is saying as uncategorized text but the text is containing the text of just the one column if you go to this GPT for all paper you can see it is extracting just this modal column and putting that into one element right so if you just scroll here you can see all the names there is not table information and now there is another element if I go scroll little bit down it is extracting all the necessary things there we don't need to care about that right now if I scroll down okay next element is here and it is also on categorized text and now here it is extracting the second column right if you scroll down it will be extracting the third column and so on right and the 73.4 74 74.8 if you go to the PDF it is this one 73.4 747 4.8 meaning that when we use the def facto method of partition PDF it is going to extract all the table information there is no doubt about that but each element contains just one column information so what is the drawback of this the drawback of this is let's say in the future we create a rag application and when we ask question from from this PDF it will not get the information because the the same table information is being splitted into different elements so this will not be the right way to extract the table you need to be careful when extracting the informations from the PDF with the tables so now what will be the solution right here I have provided you the informations so now let's go with the table extraction from PF so we need to use the high race strategy in order to extract the informations from the tables if you want to know more about this strategy things I have provided the link there are different strategies there is auto fast High R OCR only so you can go here and see what are the different document type what is the partition function strategies table support and options so from here you can know which is supported by what right I'm not going to go through that but what we can do now is okay before going through this actually I spend it around 1 hour just debugging why in my machine it is not working when you go with this High resolutions there are some packages that needs to be installed in your system just not in the virtual environment I did some research and I find that there were some necessary things that needs to be installed and I have provided the link here these are the normal steps that you need to follow if you face any issue but if you don't want to go through this I will show you how to do this via API if you do via API of on structure then you don't need to worry about this conflicting of these packages and it depends upon the machine also right in the windows in Linux in Mac it it matters so my suggestion would be to go with the API but if you want to practice itself locally then I hope these two links will help you okay the first way now the first way is just the simple I'm going to show you this Auto things here so you can say unstructured do part partition. Auto and you just import the partition there is no partition PDF here because unstructured package itself is going to choose which one to use so here is the elements partition and I give the file name it's already up there and I provided the strategy high risk and this is just the normal information and from here I just want the information of the tables right if I do this what happens let's run this let's on this so it says uh this function will be deprecated in the future info reading PDF for file so it says data GPT for all PDF as you can see here it is detecting the page elements and and so on right it is going through each and every one and by the way behind the scene as you can see here it is using OCR with TCT to extract the information of that particular uh table that's the reason there are many libraries that needs to be installed in your Miss machine also as well in the virtual environment itself so it was little bit confusing and it took some time for me to install all the necessary things in my machine but now you can see all the steps it is using all these steps and now we have this information being extracted and that is in one element now in the tables we have the information and now let's see how many tables it extracted because we had only one table in the PDF right so but before that also another thing is also if you want to just provide the PDF partition PDF you can do the same partition PDF instead of Auto and this is the normal things you you can say in for table is structure equals to true so if you just run this also you can see that is going through the same steps uh processing processing and now it will use the OCR with the T act and it will extract the information just go with the one that you find it extract the Bas information for me it seems that it's the same right so as you can see here it says processing entire page OCR with TCT and there is the information but now how to do that with the API and I recommend as I said you before also with the API you don't need to worry about the installing dependencies and they have this free unstructured API please go through this link and get the API key they will send you in your email address and for the SAS you need to pay so I would recommend if you are Enterprise level go with the S or if you want to just try go with the free unstructured API and get the API key and the API key is already in myv in this format so you can just provide that into this EnV right and now let's go here so now I'm import I'm installing the python. EnV to use that API key and here I'm just using this to import that API key and here I'm passing the API keys and here you can see I'm just using the free API key so this is the unstructured client that you need to use to using the API keys and now what we can do is this is the uh function or let's say with open RB ASF files s already shared that is what we imported in the from the unstructured client dot files and we pass the files here and here you have we have the shared partitions parameters and these are the different parameters that you can play around to get the better information but this one it does the trick for us so I will just run this so now it is going to extract the informations similar to the uh previous one but now with the API key right so it will take some time to go and extract this so once this is done we can just go and pass the table eel. category table so we get the information just from the table now it is done so I can just go here and run the tables tables and I can see okay lens is the tables and if you want to see only the text so I can just go tables z. text and it will provide you all the information so now you can see all the information is being extracted in one element so it preserves the information of that particular uh table right so now you can just go and see what is the metadata also so so now comes with the most interesting part utilizing the extracted data in the most effic C and I'm just showing you there is also the metadata here but if you want to because the interesting part here is whatever you do right there is the meta data that you can convert into the HTML and play around with it right and utilizing the extracted data in the most efficient way is it is helpful to have an HTML representation of the table so that you can have the informations to an llm while maintaining the table structure same thing but just reading the text here so if I just go here now it is converted to uh HML right okay I didn't actually okay table stml is not defined so maybe what I did somewhere is I didn't run the shell so I run this and I didn't run this one gu so here and here so you can see that this is the T table HTML and now what we can do is just using this view what the HTML in the metadata field looks like right with this piece of code if I run this as you can see here this information which is hard to read now it's converted into this uh particular format right now what we can do is as you can see here now we can just read with the uh HTML so if I run this yeah this is the information that it is being extracted from that particular table so now if you compare this with this table here it's exactly the same right here it is in the table format now we have in the HTML format clearly extracted and all the information is being preserved now what we can do is we can plug into the Lang chain to summarize the table information using Lama 3 via AMA so if you are new to AMA I have this playlist please go through there first what you need to do is run AMA in your local machine so that we can use the AMA and we are going to use the Lama 3 so you need to pull that Lama 3 just to show you in the terminal so that I already have it I can say AMA list you can see here I have Lama 3 latest already being downloaded and I have already opened my uh AMA so if you have these two things then we are ready to go so what we can do now is in install Lang chain Lang chain core and Lang chain community so this is already installed but I will install it again here just to show you and what we can do is from Lang chain Community chat models we can import the chat AMA so we are not paying anything here right and the documents we just import the document Lang chain chains. summarize we load the load summar chain right I will just run this one and by the way the beauty of using notebook if you haven't used this before is if you place two question marks behind this uh this class then you can you can see what is the let's say the code behind it so you can see what is being used so here you can see by default it is using Lama 2 and the base URL is Local Host 11434 that is the host for olama some of you asked in my previous video where is AMA being listened right if I click this one you can it should listen right if I show you here AMA is running so if you run this also if it does not show AMA is running then you haven't run AMA in your machine so you need to first run AMA and then only it will work because it needs to be listening in Port 1 1434 so yeah once that is done now I'm passing Lama 3 instead of Lama 2 and I'm using load summar chain passing the llm and the CH chain type it just stuff and I'm invoking that document and passing the table HTML so if I run this one so it is going to go through that particular HTML right this is completed and now I can run this output so you can see that there is this output text and what it looks like you can just print that output uh just to show you here it seems like you provided a table with various AI models performance metric specifically in the area of language processing and generation and this is what it is uh showing here would you like to analyze or summarize and specific this this this so this is the uh output that is being provided by the LM now we have the table information extracted and we have in the HTML format it can be easily then converted into the Panda's data frame to do the further exploration right so here what I'm doing converting to pandas PDF install pandas once that is install you can just import pandas as PD convert the HTML table to pandas data frame so you can just do pd. read HTML and pass this HTML things here and that is the DFS I'm giving this because it will be in the list right but we just have one so what we can do is DF and we can just take the first element from that list and if we print this this is the information now in the panda data frame so now you can use the Panda's functionality DF do sa DF do head and you can can see now this is the goodlook pandas data frame that we extracted from this particular table I hope now it is clear how to extract the table information because many of you have been asking this question okay how to extract the table informations and so on I will be exploring more this unstructured library and create more educational videos in the future if you are new to this Channel please do subscribe thank you for watching and see you in the next video

Info

Channel: Data Science Basics

Views: 7,564

Rating: undefined out of 5

Keywords: unstructured.io, etl for llms, large language models, get your data rag ready, what is unstructured, unstructured api, langchain unstructured, clean pdf, python code unstrucured, unstructured data processing, unstructured.io api, unstructured.io tutorial, unstructured langchain, unstructured.io pdf, unstructured llm, unstructred.io excel, extract table data from pdf using unstructued, unstructured pdf, summarize using llama3, llama3, llama3 with ollama, ollama langchain

Id: hQu8WN8NuVg

Channel Id: undefined

Length: 18min 27sec (1107 seconds)

Published: Sun May 05 2024