Best Way to Extract Tables from PDF with LLMs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

without proper data there is no use of having a generative a model or application on your systems if you want real value out of your generative a applications and models then you need to make sure that your data is of highest quality plus your data is processed properly llama index is one such framework or a bunch of tools which enables you to process or make your data AI enabled llama index has recently released a new version plus another cool tool which is called as PDF pars it is still in public preview and you don't have any source code available for that the only thing we have a available about PDF pars is an APA call the whole idea behind PDF pars is that you give it your PDF document it processes that document and what makes it really different from other people PDF parsers with is that it extracts the tables and figures quite nicely in my own experiments I have found out that yes it deals with the tabular data quite nicely but when it comes to diagrams and figures it still lags behind anyway but even if it can do tabular data properly I'm good with that because I have found out that lot of tools in the market they're good with PDF extraction but when it comes to table data the struggle but PDF parts has shown lot of promise now as I said it is just an API so you don't have access to this PDF Parts you can cannot really install it locally but yes you can uh install Lama index locally and then make an API call and before you make an API call you also need to set the API key for Lama index Cloud yes we have another cloud and we have another API so in order to do that go to cloud. L index. and I will drop the link in video description too once you go there sign up with your Google account or whatever email you are using and log in there once you log in scroll down to the front page click on API key on the bottom left and from here generate your new key I already have generated my key here and you can delete and then uh do another one okay so I'm back to my par screen so that is all you need from this Lama index Cloud for the purpose of this video I will just use one PDF file with one table and we will install llama index with this PDF part and then I will show you quickly that how can you pass the PDF files by using this Lama index new tool and I'll be using Google coab for that but if you have python installed anywhere you can do the same from there okay let's go to the Google collab and this is my Google collab let's first run change the runtime type to T4 GPU which is free from Google thanks to to Google let's install Lama index and Lama par let me run it it will take some time few seconds I guess it's not that huge so let's wait for it to get installed and as you can see that it is getting installed now while it gets installed on the left hand side you see the key icon here I have already set my llama Cloud API key and open a API key because that is what I will be using in my Google cab so make sure that you get both of these keys and store it here in the secrets just click on add new secret and add it there okay so I have it in my secret now let me click on Plus Code and just retrieve both of these keys and I am saving these gr access and I'm saving both of these in my variable yes gr access that is great okay okay now let's set our asynchronous code and this these two lines are relevant for the collab environment in the Jupiter notebooks whether Google collab or any other collab because when you're working with a synchronous code you might encounter issues related to running async task in a loop that's already running this is because the default async iio event Loop doesn't allow nested event Loops so you cannot start an async task if there's already one running in the same thread this is a common issue when using libraries that require running asy sync functions and that is where it has to avoid such RIS condition sort of things so I already have run it now once that's done we need to initialize our Lama parts and we will be importing it from the module which we installed so all we are doing it we are importing the Lama pars and we are doing the result type in markdown because that is what it supports and languag is English and then let's run it shouldn't take too long here okay okay so what the is that's key which AP key I already have okay I think it needs to pick up the AP key which I already have set let maybe I have put in the AP key name wrong let me check quickly what I have done here is I have just put in this Lama Cloud API key which I have stored as a secret and then I have added this line which I shouldn't because uh be doing because this is what the documentation has written here maybe I could show you so this is a documentation it says that you could set it in your environment which I already did but anyway let's run it see if it works okay so this time it work so not only you need to save the secret you also have to assign the secret key okay that is fine now let's specify our PDF file you can I already have uploaded it here so if I click on this folder icon I have uploaded this table.pdf if you want to upload it from your local system just click on this um icon and then upload the PDF file let me quickly show you the PDF file too so this is a PDF file which has one table with some headers and columns and rows so this is what I'm going to pass with this Lama parser so I'm just using the same parser which we have just initialized here and going to load the data so let me run it so it has started the parsing job and it already quickly did it so speed is quite good that is another thing I have noted because in lot of PDF parsers the speed is not good but in this one speed is very very fast and now let's print out what exactly it has extracted so I'm just going to go with document zero just getting the th rows there you go so you can see that it has identified the structure like disability category and then the these are the rows here and if I show you the table again so you see that there is disability header participant ballots completed and then in this First Column we have blind low vision if I go back here you can see the same and you can see the data too and it is also segregated here in the markdown and this is what it has differentiated between header and the data so this is how easy it is to you know extract the tables from your PDF file and of course if you have any Tex text it is also going to do that and then of course you can do lot of things with this data you can put it in your retrieval engine cury engine and then go from there so I will drop the link to This Cloud uh and use it it is still in the public preview so let me know your thoughts on it I give it full marks for tabular data plus speed that's it guys I hope that you enjoyed it if you like the content then please consider subscribing to the channel and if you're already subscribed then please share it among your network as it helps a lot thanks for watching

Info

Channel: Fahd Mirza

Views: 668

Rating: undefined out of 5

Keywords:

Id: HJI8k2Rl7ps

Channel Id: undefined

Length: 8min 6sec (486 seconds)

Published: Mon Mar 11 2024