Extract and Visualize Data from PDF Tables with PDFplumber in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
howdy howdy howdy this is JW catter and I'm a journalist in Hampton Roads Virginia and this is a walk through on how to extract tables from PDF documents with PDF plumber in Python so just to start out as a disclaimer I am a decent journalist but I am by no means an expert at programming that's why I'm calling it a walk through instead of a tutorial however I do think that you know people especially journalists can benefit from watching someone who's not an expert at this do it because hopefully it will encourage you all to try it out for yourself you really don't need to know that much and it's not that hard um and you know if people point out what I could be doing better in the comments I'm totally fine with that because I'm sure there's stuff that I could be doing better so why would you want to extract tables from a PDF document the best way I can explain this is just by showing you the PDF I was working with on a story this is something I obtained from the Department of Behavioral Health and developmental services in Virginia I'm working on a or I just published a story about involuntary detention if you're interested in learning more I'll put a link to that in the description this was the data that I obtained and as you can see this is pretty common for things that you'll get out of public records requests and you'll see as we scroll down that there's new tables with new information that's not related some information is not included in certain tables and it is in others there are times when you would get a document from a public office and it would just be easier to you know copy like manually write it into a spreadsheet and then you can work with the data that way but for something like this it's sort of in the gray area of that but also you know if you're wanting to learn how to program this is a good opportunity to try and learn how to do some basic data analysis and visualization so as a journalist you know you want to sort of look at the document first before you get into programming to see if you can spot any trends and for me I noticed pretty quickly that if you can see there's loss of custody events in this table here and the total events this is back in 2015 if we scroll down uh you see that they stay pretty consistent for a few years but then once you get to closer to the present we see this big jump from almost almost a triple of the number so obviously I wanted to for my story show this trend that was sort of the center of the story or the the rest of the story so I wanted to use the data to create a visualization of that and PDF plumber was the tool that I found that would make it easy to do that I should mention that this particular PDF is already ocr'd meaning that the text is recognized so you know if you're interested in learning how to do that I'll put something in the description but also I'd be happy to make a video about that too I run Linux on my computer so this next stuff is going to be in the terminal might be a little scary if you've never seen terminal stuff before but I'm essentially just going through folders like you would in your file system on Windows or Mac it's the same thing so first of all I'm gonna go to my projects folder so next I'm going to create a virtual environment if you don't know what that is PF plumber is a module in Python and most programs in Python will use several modules and a virtual environment is a way where you can create a virtual environment where you can install certain modules that you only plan to use for that project so that you don't have to install them on your computer and it takes up all the space so to do that I'm going to type in Virtual environment we'll say PDF plum and then I'm going to open that folder and this is going to be different if you're in Windows or Mac so you know if you're you'll have to figure it out yourself basically if you're getting that but I'm going to type in this command which activates the environment and then I'm going to create a new folder PDF we'll call it PDF project open that and then we're going to create a python file and then we're going to open vs code vs code is the IDE which is an integrated development environment for python it doesn't really matter which one you use but that's the one that has been easiest for me so the first thing we are going to do is open a terminal in here and we are going to use pip to install PDF plumber there's other ways to install it which you can find with the PDF plumber documentation which I'll include in the description as well but yeah it's installed so we are going to import that you're going to import pandas because we're going to need some data frame stuff I guess I need to install that as well and finally we're going to import matplotlib and matplotlib is what's going to allow us to visualize the data once we've analyzed it okay so first of all we are going to open the PDF with PDF plumber so to do that you're going to want to make sure that you copy and paste the PDF that you're wanting to analyze into the same project folder that you created for the program literally do that just by copying and pasting with your file system which I've done so now you can see the PDF here so to open the PDF with PDF polymer we're going to use the open method which you can you know PDF is we're just setting a variable here and then we're calling pdfplumber dot open which is the method and then we're going to want to call the PDF I'm going to rename this to something that's less long and complicated so we'll say example PDF and we will type that in here we're going to want to put it in quotes oh my God okay so now it should be able to open it and now it starts now we have to start thinking a little bit more we have to start you know deciding what we want our graph to show so in my case you know as I showed earlier with the data I'm mainly interested in the number of events so you know if I I could just show it by year and that would be fine but a more interesting graph and since we already have the data I know anyway I want to show it by month and so we really need the month data and the event data those are the two main things and you can see this is just a mess up here so we have to sort of think creatively about how we're going to get this so the idea that I came up with was you know I'm just going to create a basic Python's list we'll say loc for loss of custody um which is the main data column that I'm interested in and then we are going to create two columns for that one is going to be month year the other one's going to be total events which that is not what they are called in the PDF but we will get to that okay this is the fun part this is where we're going to write a function that will extract the data from the tables in the PDF so in Python you say df4 to define the function we'll call it get total and then for the parameter we'll just put Doc we're going to enter this PDF but this is just a placeholder for now Okay so what we're going to want to do is create a empty list and in this list is where we're going to add each of the rows that we want to analyze so to make things simple and to move through the because we're working with multiple years of data we're going to create a year counter so even though the table is starting the first table is 2015 you'll notice that it's fiscal year 2015. and this creates some confusion for me when I was first programming this but so you'll notice that it starts this is actually July 2014 and then it changes to 2015 January here so we're going to have to accommodate that but so we're gonna have the year counter actually start at 2014. and then another list that we're going to want to create is the actual pages and so with uh python counting starts at zero so this first page is actually zero in the way that python looks at it so this first table is the actual and we want to start with which would actually be one in Python so I won't bore you with going through the actual PDF but you'll just have to trust me that these are the pages that I want so it's going to be one two three four five seven ten and thirteen okay and then we're gonna do another list and this is going to actually be using PDF plumber so we are going to use the list function and then we're taking doc which this is the PDF that we've opened with PDF plumber here and then we're going to use the PDF plumber Pages class to start pages and so what that is doing is converting the PDF page into a object that will be able to interact with easier in Python and the way that we're going to do that is we're going to use something called less comprehension so this is an index so like you know you can think of it like page numbers which is kind of confusing because the class is also called pages but we'll save a for Loop which is for I in select pages so here we're saying okay take this PDF that we're going to enter here and use PDF plumber to extract the pages from it and the pages that we want here's the page number we're going to say that basically each page that has an index in this list is what we want so basically we want page with index one which as I explained with python actually means page two and so on okay let's continue so now we're going to actually start to iterate or go through the pages now that we have the pages so we'll say four page we'll use another for Loop four page and Doc pages now remember how I said the fiscal year so that makes things a little complicated because we're iterating through each table but the the year number is going to change halfway through so the easiest way I thought of to deal with this is just to split the uh the table into two lists so we'll do the first half which will be the first part of the year so we'll say table first half and page and this is where we actually do the extract table and PDF plumber so we'll say extract table we'll say negative 13. to negative seven the reason why we're doing negative numbers here is because as you can see they're assignment this is really confusing up at top here it's not really clear where the rows start like is this one or two rows I don't know but we can just avoid that by counting backwards instead starting from the bottom so negative seven this is saying negative 13 to negative seven so if we go one two and with we talked about how python starts counting from zero that's not true for negative numbers so we'll do negative one two three four five six seven eight nine ten eleven twelve thirteen okay so starting with July and then negative one two three four five six seven and so we say to negative seven which won't be including January it'll just go up to January so it'll go it'll be what we want July to December okay okay and then we'll do the same thing for the second table or sorry we'll do the same thing for the second half of the table so second half this page dot extract and once again Paige is rendering through all the pages that we created here so extract table be careful here because there's a separate method in PDF plumber that is extract tables plural I just want to use the singular extract table because there's only one table that I want on the page you can look at PDF plumbers documentation which I'll link to and then for this one we are going to do negative seven so this time it will include seven and then because it's we're starting from there and then up to negative one so basically up until the last row but not including the last row because we don't want the totals okay so now remember how I said earlier that the table gets wacky at fiscal year 2017 and it randomly adds this extra column which is going to mess with how we're counting it so we're going to have to add a if statement here so if your counter and even though it's the fiscal year 2017 we're going to say if is less than 2016 because that's the actual year that it starts on July 2016. so now we're going to iterate through each so this is extracted basically it split this table into two so we've essentially got two tables we've got the the end of the 2016 year or if we're starting from the very beginning we've got the end of 2014 and then we've got the end or the beginning so we've got the end of 2014 and the beginning of 2015. That's what these two tables are so now we're going to iterate through those rows that we extracted so we're going to say four row and table we'll do the first table first and we're going to say total rows this is the empty list that we created earlier and we're going to say let's add so in Python you say append to add to empty list okay so this part might get a little confusing but just hang in there with me so basically the row that we want for the purposes of our graph we want to show the loss of custody events over time so we first want to pull the month so basically we're going to basically I'm going to take the month from the table we're going to add the year right so to make sure it's the right month is corrected to the right ear and that is going to be the x-axis of our data visualization that we create and then the loss of custody events will be the y-axis so that's the two bits of data but because the way that the data is listed in the table where it's just the month right so we're gonna have to add the month with the correct year with programming so the way that we're going to do that is the month is in the First Column of the row so we are going to say add row and then counting from zero this is the actual First Column the brackets mean it's in that's the index or the column in this case so then we are going to add a space for purposes of formatting and then we are going to do scr which basically means we're going to convert the year counter which is a number into a string in Python which basically means words and stuff right so we're going to say your counter yep and then so that's the that's the date right that's the first piece of information now we are going to add the loss of custody events okay so let's check that so remember it changes in 2017 so we want to start from the beginning so that's zero one two three four five six okay and so we're going to want to do the same thing for the second half so we can just copy and paste this make sure the indent is correct and then we want to change this to second half and then we're doing the same thing except remember the year is changing halfway through and so here we are going to do plus one okay and now so we're saying this is what you do we're telling the computer this is what you do for if the basically it is sort of convoluted way but we're saying you know if it if the year is less than 2016 do this if it's not in other words once we get to this table because the reason why this works is right when we're we're telling the computer okay your counter is 2014. go through this table and so we're going to have it iterate through the whole table both the first half and the second half and then at the end here we're going to create we can just go ahead and do this we're going to create a ear counter so we're going to say we want to make sure it's still in the for Loop but we'll say your counter plus equal one so in Python this is this is basically a way to say okay take this variable and assign it this is basically a shorthand for a year counter plus one but it like just saying equal like this so it'd be like saying why am I getting the red squiggly right here indented block okay I'm not sure what that means but hopefully it's just gonna go away but this is basically the plus equals it's a shorthand for this but we don't need to write all this out we can just write plus equals one okay okay so oh I think this is it's giving me the red squiggly line because it thinks I want it here but I don't actually but okay just uh Circle back to what I was saying so we're telling the computer okay go through this for Loop your counters 2014. okay now that we're done with this table let's add plus one to the air so now it's going to be 2015 which is correct because this table starts in July 2015. and then with this plus one we are making it so that it'll add the year at the appropriate time in January but then once we get to this table which is fiscal year 2017 but starts in July 2016. and by this time we'll have added to the year counter which is starting at 2014 twice so one two so we'll be at 2016 here and so that's when this is going to come into play because the computer will say oh the ear counter is not less than 2016. so what do we do in that case so we're gonna do the exact same thing and I this is clearly like really uh not not elegant code I would say there's probably a easier way to do this or or a more concise way but this is the solution I came up with and it works so we're going to do the same thing but instead of six we're going to change this to seven because of that additional column so let's just go back and double check our work so zero one two three four five six seven and that's correct okay and yep sure enough the red squiggly Line's gone away now and so now we're going to do return total rows and we can check always good to check and make sure things are working before you continue so we can do a test variable and we'll say gets total PDF and then we can print it let's see what happens okay we got an error you show this usually happens foreign oh right okay I see what happened so we need to actually you can only append one thing is what it's saying and we're wanting to basically enter in coordinates for the graph into this list but in order to do that we need to basically make we're basically adding lists to a bigger list so we want to the way we do that is putting brackets around here that way we are only adding technically one thing even though it's a pair of coordinates basically oops so do that okay let's try that again okay great so now let's check and make sure that this data is matching up I'm not going to go through all of them but let's see so July zero that's correct August to September 1 October 7 November 4. okay that's looking good let's go towards the end make sure that looks correct fiscal year 2022 [Music] okay so let's see June 2022 uh wait this is the wrong table okay June 2022 total nine yes that's correct 12 yes nine nine nine yep 16. okay yeah that's looking good all right so now we're going to write another function and this is basically the problem with the way that this is right now is one thing these numbers are strings so we want to convert those to things that python will recognize as numbers and then we want to convert the these strings into something that python is going to more easily understand to be the date and it's basically a way of making it so that we can more easily graph it with matplotlib which is another python module okay so let's get rid of this we're gonna do Define we'll call it clean clean data okay so it's going to take the rows and our columns which this is sort of misleading so the rows we're adding from here but really it's it's this information right so it's these date and number of pairs and then for the columns we just sort of artificially created these but these are referring these are basically in place of what was in the PDF right okay so so to do this we are going this is where we're going to start to use pandas so we're going to create a data frame DF for data frame PD short for pandas dot because we're using a method capital D it it won't work if you don't capitalize D and F in data frame and then this is basically saying okay we want to make what we're entering into here into a data frame making into a data frame will make it easier to visualize the way that the data frame thing works is you have to say columns equals calls okay now to reference columns and a data frame use brackets I don't really know why I'm explaining this because it's sort of an insufficient explanation for people who don't understand it and then for people who do understand it's totally useless whatever okay so total events and we're basically going to remember what I said we're going to change these numbers into Strings so we'll say it's equal to Total events and we're going to use the apply method which is basically going to say okay use this pandas method which is too numeric sort of converting to a numeric and do it to every Row in this column that's basically all that's saying and then for the month of the year we're going to format that using another pandas method to date time month year okay sorry let me clear out this terminal okay and then we're going to format now this it depending on how you want to do this you'll have to look it up yourself how you want to it's going to depend on the way that you have the data like you saw earlier this is a string where it's like month year and so I looked that up in the way that you do that is percent sign b space percent sign y don't know why it's a b but it works okay and then now we are going to you don't really have to do this but I just did this to sort of make it easier to visualize we're going to make the dates the index for the data frame set index date month year yep and then when you do that when I was programming this earlier I found that it was jumbling up all the dates when I did that and so we want to sort it by the year so the way that we can do that is DF sword index okay otherwise you'll get a weird looking graph with a bunch of zigzags okay so let's make sure that works test okay well we have to let's let's do create table oops let me do it so create table and for the rows we are going to call function inside the function so we'll do get total for the PDF and then comma because remember Rose columns so for the columns we didn't block columns okay oh create table clean table okay table clean data there we go okay we'll print the test okay so you can see how it's taken what was essentially a coordinates of two strings now we have an actual format and the reason why it's printing twice is just because this is the index but then these are still matching up these are still the correct dates okay okay so let's visualize this thing so I'm really gonna uh go quickly over this like I said this is walkthrough not a tutorial but just to show you sort of how it's pretty easy so PLT is referring to matplotlib here PLT Pi plot and with DOT style.use that basically is I guess you could call them templates of you know just colors and the way the graph looks so I looked at those and I you can Google those and I like Seaborn dark the best and then we are going to tell matplotlib you know okay well okay so first delete the variable but we'll call it loss of custody data frame will do clean clean data get total PDF what we did before columns so we'll tell it we want it to plot DF specifically the total events and we only have to tell it this column because it will assume that the index is the x-axis and then this is just some labeling stuff so we're going to say events per month set the font size okay and we'll give it a title positive custody events with the font size add some padding oops add a comma padding now this next part is going to look really confusing but this is something I just looked up as a way because when I was first plotting it it was listing all of the months but it was squishing them all together to where it was illegible so I looked up how to just you know format it so it was just the year and this is only possible because we did this function right which set it to daytime and set it to this format but now that we've done that we can do something where it says GCF Auto format X date okay and then this is just having to do with the layout again adding some padding okay and then we should be able to just do plot.show we should get a graph fingers crossed there we go let me get this thing bigger yeah so there you can see now if you know like I talked about earlier if you had just gone through the PDF and picked out the totals for each year you would have had a more boring graph with less data points but since we had all of the data for each month why not do this and make a more interesting graph and there's the labels right there and that is it you can see that it's a pretty clear Trend upward which is what my story is based on so I hope that that concludes this walkthrough hope you got something out of it uh like I said at the beginning if you have suggestions about how I could I think the technical term is refactor my code or make it more elegant make it more efficient please let me know in the comments if you have any questions let me know if you have any suggestions about well or what other videos you'd like me to make let me know other than that have a good rest of your day
Info
Channel: JW Caterine
Views: 10,154
Rating: undefined out of 5
Keywords: walkthrough, python, journalism, data analysis, pdf, tutorial, data, data visualization
Id: x9IDL8eruAw
Channel Id: undefined
Length: 39min 17sec (2357 seconds)
Published: Mon Jun 26 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.