How to Extract Tables from PDF using Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone and welcome to my channel in this tutorial we will discuss how to extract tables from pdf files using python when reading research papers or working through some technical guides we often obtain them in pdf format they carry a lot of useful information and the reader may be particularly interested in some tables with data sets or findings and results of research papers however we all face the difficulty of easily extracting those tables to excel or to data frames thanks to python and some of its amazing libraries you can now extract these tables with a few lines of code to continue following this tutorial we will need the following python library tabula pi if you don't have it installed please open command prompt if you're using windows on terminal on mac and install it using the following code please note that tabula pi is a python wrapper for tabula java so you will need java installed on your computer in order to continue following this tutorial in python i also provided a link where you can download and install java on your computer below this video now that we have the requirements installed we'll need a few sample pdf files from which we will be extracting the tables the file i'll be using is solely for the code examples and you can simply access it by this url which i also provide below this video i downloaded this file and saved it in the same directory as our main.pi as sample.pdf however for this tutorial it really doesn't matter because the library allows us to access the pdf file either using url or just by reading it from a directory so let's take a look at this file it's a two pager that has a few tables which we're particularly interested in there's one here on page one and there is another one on page two and another one on page two so in this tutorial we'll discuss how to extract specifically these tables from this pdf file now let's dive into the code in this section we will work with the file mentioned in the previous section so if you took a look at it you could see that it has a total of three tables on two pages so there's one table on page one which is this one and then there is two tables on page two right here now suppose that you're interested in retrieving this table specifically from page one we know that it is in the first page of the pdf file and now we can extract it the csv or data frame using python so the first method that we're going to look at first requires us to import the library and then specify the path to the pdf file note that there's two ways of doing it so i can either call sample dot pdf since i already have it saved in the folder alternatively if you don't have it downloaded you can have pdf path specified as the url i mentioned earlier and i've also provided a link to it it doesn't really matter which either which either of these you use so i'm just going to use the path to the file in my folder next we will have this something dfs and actually it's going to be a list it's going to be a list of data frames and i'll explain shortly how exactly it works so to read the tables we will call tabula read underscore pdf we will need to provide the path to the file to it and specify which pages we would like it to inspect so we know that the table is located on page one so we specify pages equals to one and what the above code does is that it reads the first page of the pdf file searching for tables and appends each table as a data frame into a list of data frames dfs um so obviously we can check how our code is doing by printing the length of dfs and we know that there's only one table which we're interested in so the length here should be one so let's take a look if that actually works perfect so we see output here so there's only one data frame now if you want to take a look at how it looks like we can just essentially print the first entry in the list and we should get exactly the table that we're looking for and here it is right exactly as the table that we wanted and lastly the step would be to write this out to a csv file so we will locate um the data frame and we will just call to underscore csv and let's just call it first table csv let's run the code so we see the completed and then we see this first underscore table csv file right here right exactly with the same entries as we were interested in now this was the first method the second method allows us to basically do the same thing but with way less steps so we would still need the path um to the file but what we can essentially call is another method here which is called convert underscore into so instead of running several lines of code this basically allows us to do everything at once so we give it the path next we need to specify um the name of the file that we're writing and let's call this first table underscore 2 dot csv for the second method we also need to specify the output format and the output format is going to be csv and lastly similar to the previous method we would need to tell it the pages that we would be inspecting and that's also page one let's run the code and we see that the new file has been generated right here right and these two csv files are identical so great functionality two methods but it's important to note that both of the above methods are easy to use when you're sure that there is only one table on a particular page right so if we look back here we knew right away that on page one there is only one table now what happens if there's actually two tables right this might create some complications to the current logic so in the next section we will explore how to adjust the code when working with multiple tables recall that the pdf file has two tables on page two a larger table here and a bit of a smaller table underneath it and these are the tables we would like to extract so using method one from the previous section we can extract each table as a data frame and have a list of these data frames so we already have the pdf path now let's do the same thing as we did before tabula pdf pdfpath and we basically now tell it to only look at page two note that it should only be two data frames in a list so the length of dfs should be two so let's check that okay running the code is completed the length is two so we have two data frames so each table is a data frame stored in the list now if we want to write it to csv we'll simply just need to do a for loop for i in range lan dfs dfsi so we'll iterate through each entry in the list and um save it as a separate csv and let's call it um page two table i here and then dot csv right so let's run the code perfect so we see two csv files extracted here and they are the two tables we're interested in note that if you try to use method two described in the previous section it will extract the two tables into a single worksheet in the csv file and you would need to break it up into two worksheets manually in the previous sections we focused on extracting tables from a given single page of the pdf file now what do we do if we simply want to get all the tables from all of the pages in this pdf file and save them as separate csv files keep in mind it's relatively easy to go page by page when you only have two pages and you know exactly what you're looking for but if you're working with some research paper that has a hundred pages and there's 20 or more tables obviously there needs to be some better way and thanks to tabula pi it's um very easy to implement this even if you have several pages with several tables all that we need to do is to essentially tell the method to look at all pages instead of looking at only one page right so and let's just call our new files all pages table i so now after running this code the goal is to have all the tables saved as their own csv files and we know that there's three tables so the solution that we're expecting is basically three csv files let's take a look perfect so here are our files the first table the second table and the third table in this tutorial we discussed how to extract tables from pdf files using tabula pi library if you're enjoying my videos please subscribe to the channel hit the bell button and be the first one to know when the next video gets uploaded feel free to leave any comments below if you have any suggestions or ideas for future videos you would like me to make also check out my blog for complete code walkthroughs i've provided the link before and stay tuned for more python programming tutorials
Info
Channel: Misha Sv
Views: 61,270
Rating: undefined out of 5
Keywords: python extract table from pdf, python extract data from pdf, python extract text from pdf, python tabula, python tabulate, python pypdf2 example, python tabula multiple tables, python tabula example, python camelot, python tutorial for advanced, python table extract, python table pdf json, python read remote pdf, python extract data from pdf file, python, tabula, camelot, pypdf2, pdf, programming
Id: tEFAFQXaOWw
Channel Id: undefined
Length: 14min 7sec (847 seconds)
Published: Sun Oct 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.