Extracting data from PDF files using Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back i want to look at at pdf files and i want to write a python script that can obtain information from these pdf files and returns the total number of times a search term appears in a document and i also want to obtain the page numbers where a particular search term appears now in practice this is a very frequent problem which you encounter when you want to construct measures of good or bad corporate governance as an example i have here a financial statement which was published by marx and spencer a retail company in 2019 now when you have a pdf document of course what you can do is you can manually collect information yeah so we could for instance look at audits yeah so we could or we can look at audit term and we would be able to find the number of occurrences just by him doing a manual search using control f in the pdf document of course that you know is maybe an acceptable process if you have a handful of pdf documents however if you have hundreds of them maybe thousands of them it becomes really really very um cumbersome so can we use python to make this a lot more fun yeah so the idea is we want python to do this work for us yeah so please open a pdf file for us look for certain terms count the number of occurrences we want to have all the occurrences in the document and we want to know on which page does this term appear so that's the task let's get into it so how do we fix it so let me first talk about my python setup and let me close the pdf file in my case i run python um in sublime so um well that's my text editor to be precise and i execute using the command line you might have a different setup for instance you might use an anaconda you might use then spider or jupiter as ide which is perfectly fine um the thing we have to do is we have to install if you haven't done so and a package which we need to work with which is pi pdf two so that's the one we have to install so either you have to do it um through anaconda or you have to use then the pip installer and to do that if you have any questions about how to install pi pdf2 just leave a comment below and i will get back to you but you find them all this quite easily in the online documentation all right now before we go any further i just want to note that all the codes all the material is available on github and i leave down below in the description and a link to the github folder for this particular problem okay now let's get into this in a bit more detail now let me just open up my command prompt so in here i just type in cmd and i'm just in the right in the right location and at the moment i don't have any virtual environment activated now personally when i do some some um you know professional work i would strongly advise you to do this in a virtual environment in particular if you have to install um some packages that sometimes can get a little bit tricky because some of them have of course dependencies and so on so to get a virtual environment what i am have done is i just um put in here um virtual enf for environment and then i named it as pdf yes and i executed that i already have created that so if you haven't done this um i would recommend you create a virtual environment using virtual in enf command and you can name it if you haven't installed virtual environments yet again you have to do this using the pip installer you don't have to have a virtual environment okay that's not really needed it's just um a lot safer to be honest and it's much easier later to obtain requirements and you know to make sure that someone else using your code knows what you actually and what you need to do and to run your code so i already have done this and you see now in my folder that we have a subfolder pdf yeah and you see the subfolder scripts the only thing i have to do now is i have to activate my virtual environment which i do as follows i refer to my current directory and then i go into a pdf i go into scripts and then i go into activate yeah this will look a bit different if you are on a mac operating system yeah so now it's activated and you see now the name of my virtual environment popping up inside the brackets and i'm ready to go okay good now i move back to sublime so i import um pi pdf tool i will also import already re re is is an amazing is an amazing module for matching of strings it contains many many useful functions so re i believe refers to regular expression syntax again you can correct me if i'm wrong down in the comments i'm usually wrong most of the time um but i think it it refers to regular expressions there's a lot to say about it um again we don't have really the time here but if you want to know more um just um drop a comment okay now first things first um i want to assign now my file yeah um of course this could be not just one file it could be many files you have in a folder then of course you have to modify a bit and the way you code so in my case i just have one sample file let's just call this file underscore name using lower snag case and then i refer to the file name which i believe is ms 2019 and then pdf is the extension yeah so that's ms for marx and spencer and 2019 is the year then you could start a document document is just the name i i use for the object it's a doc object a cop is a document object but basically what i now do is i refer to my pi pdf to um module and i basically take this file and i take this and assign it to um to doc and this gives me my pdf yeah and this is using pdf file reader and now i have to pass in the file name so it might be um useful just to check um you know what is going on here just to do a type function on this object because we don't really know it yet of course what you can always do is you check the online documentation um and let me just run this now python and i use your pi dot prime oops not pi pdf funny here we go and you see now popping up the class pi pdf pdf2 pdf pdf file reader okay so it seems to have worked up to this point good so what can we do with it and of course now you have to really go into the documentation which i have done to know what you can do on this doc object okay so what what do we what do we need first what we would like to have is the number of pages yeah because that's helpful if you want to write a loop command we have to know um yeah how many pages are there so let's do that i just call this here pages and i refer to doc do the dot operator and you already see the first one here it's get num pages and i just open close brackets um so i execute this method so that should give me the number of pages again it might be useful maybe to do a little print and just you know check whether it's still all up and running as expected so you run it one more time and we get 65. okay that's promising so now we have the total number of pages i just remove that um the next thing i would like is i want to specify the search term and in here i just i call this search which must not be the best name it might be also a function you never know but let's call this search or maybe search term might be better let's leave it there let's see what happens um and i go into say independent you know let's do this in deep pendant yeah my spelling is is always terrible it doesn't matter in my case what language i use spelling um is is always awful and the best thing is just to use numbers okay so here we are so now what we want to do is we want to look for the term independent um looking at this pdf file and of course you should use your own file for that and and see what happens um whether we can actually find it okay now what do we want to get out of it so i i want to have um something in return so when i execute my script what i want is i want to have a total number of occurrences and i would also like to know how many times the search term appears on each page and i would like to have the page number as well so how can i um package this um i could imagine returning a list yeah i always like lists it's easy to work with so we can have a list we can of course inside the list have other lists but personally i quite like to have a tuple in between so a list of tuples because i don't want to change the structure of a tube now what does this tuple do it gives me all occurrences and it also refers to respective page number yeah so this is basically what the list of tuples does and i start building it up here so i just say list and that's pages and i start with an empty list so open close um rectangular brackets so you get an empty list and later i want to put in the troopers which give me the occurrences all occurrences um on the respective page yeah so that's the idea good okay right now um well i want to now go through this document yeah so how do i do it i use a for loop that's the obvious thing to do i call the respective page number very perspective page i so for i'm we go in range and this would now start running from zero and it goes up to the number of pages so it's pages here and we do we shouldn't forget the column here i go into indentation i'm inside the block and now i have to again check the documentation and see exactly what i need to extract um the text from the pdf and what works in this case is you have to do this and page by page and this is why we do the loop over the pages and basically what we need to do first is we have to basically obtain an object based on the current page yeah and i just call this here the current page so what it does is it basically takes the whole document and now page by page we we throw the information into the current page object and then on that object we try to extract the text and then we do some matching so that's the idea good how do we do it we refer to our doc document which we already have done before and i now want to focus on a particular page and i just use get page and refer to that page and i just use of course now my index i here yeah so i go through all the pages up to the maximum number of pages and i put the information from the document to my current page this is what you have to have in mind and then i obtain the text from my current page using the dot operator so and here we extract text here extract text again you have to of course check exactly the documentation and to know these things because well you wouldn't accept you you actually wrote this yourself good so that's that's a method which we call on my current page um and that should actually do it so now we have the text hopefully yeah so that is interesting we could maybe just try for the fun of it how this looks like i have no idea i haven't tried this really before i don't want to do it for all the pages i just want to do it maybe for a couple of pages or something so we just change the range now it's just for testing let's just do that and see what happens um yes actually this is what you get you get indeed now the text popping up so it's reading through this document yes that looks promising and that's just for the first um for the first um pages um yeah so that's all it is yeah that looks promising so we get basically text out of it yeah and so we can do some string matching on it yeah so that's it's always nice in between to use some print functions just to see you know what what is happening here because to be honest in in python because it's dynamically typed you never know you know what what object you get you should always check and in particular if you work with some you know modules you haven't worked before it's always good to check the documentation and do a few you know print statements good so now i want to use the mre package to do some string matching now and if you do the following so this is our e and you use find all find all is in is exactly what we're looking for because we want to have all occurrences not only the first occurrence but again it depends on your problem it depends what you're looking for so usually for for my type of work it's it's useful to have all occurrences so i use the find all and then i am have to refer to in the argument i have to refer to the search term which is search and where i look which is text yeah so that's it now um this um here and again this is in the documentation will um only return something if you find something yeah if it doesn't find anything and it will be simply um it will be nothing returned yeah so if we want to look for the cases where you actually find the word at least once we can do this in an if statement yeah so we will only execute if it's found otherwise it will be just not found um and um it would not go into this if statement so if maybe do first another print just to to to check how it looks like let's do that i just copy paste that and just do a print statement yeah so again the logic is now we actually go back to pages the logic is we go through all the pages up to the maximum number of pages and we take the page from our document of our complete document pass it on to our current page document and then we extract from this object the text which we already know is is text so that should be fine for string matching when we use the find all and which comes from the re module which gives me all the occurrences and that only shows me um something if it's found yes let's have a look how that looks like okay that looks promising so what you see is you get um a list um and this list has here only one element and sometimes it might have more has two elements yeah so um obviously you get a list so that also means if i am use here the evil the len function um i would get the number of elements contained in that particular list and this gives me the number of occurrences on the page yeah which is the first thing i would like to have yeah so in here what i could do then is i could simply do the following get rid of print and use len so do the len function which gives me the number of elements contained and this would be the number the number of um of my counts yeah so i just use count pages here so count on page might be maybe nicer to write you you can modify this of course so this would give me the number of occurrences on a particular page yeah so that would be exactly what i want and then i want to modify my list underscore pages so this is the empty list just appending the information and again i want to have the structure you see above here all occurrences which is simply that that's count page and then the second element in this tuple would be the page number and i have to be of course careful that i put also enough brackets in because we have a tuple and the page number here is i that's the index i use in in this for loop yeah so this should give me um an appended list page which contains the tuber that provides the total number of occurrences and the respective page okay oh yeah i forgot one bracket yeah that's it good so that's that's with rubble i forgot one bracket and things are getting a little bit funny um but now we are back on track so here we are let me just run this one more time it's running through let me maybe just now after x exercising that do a print function and look at list pages and just see what's what we obtain so this is here our result basically let's do that let's run this one more time so what do we get out of it here we are so we have um here one occurrence on page zero one occurs on page two two occurrences on page 29 and one occurrence on page 30. now that's pretty cool yeah so it's working as as you would expect yeah so that's that's quite nice so we have now actually all the stuff we need um the only thing we could now do as well is we could um think about you know um yeah combining everything into into a total word count if we want um and we might also like to to count on all the pages that you find the result on see we can of course now you know modify things as we want so i just do the following i say here number of pages that contain the search term at least once yes we can we can ask that which would be simply accounts i just call this count again a len function and i would just look at the whole list and just count how many of these tuples do i have yeah so that would give me the number of pages that contain this the search term at least once okay so i might like to know this then we can also do of course something like the total word counts [Music] yeah so again if if this is what we are interested in i just um do it here let me just do a total so what we need to do here is we have to basically do the following so i just maybe do it step by step so we do um i think the best is we do a list comprehension here so and we know we have a list of two pers but now we just want to have the number of occurrences so this is tuple position zero so i do something like this two for two pro um and i index i use zeros i look at the first position inside the tuple that gives me the number of occurrences and then i run through for troop in list pages okay i explain that now in a minute so this is called a list comprehension um i am i obtain a new list yes you see here we have a notation for lists we have rectangular brackets i um start reading here so for the tuple inside my list pages i am put into my new list um vm tuple position zero yes i should get a new list where i only half v occurrences but not the page numbers yeah so if we run this we should a total should be now a list yes if we run this let's see what happens if i didn't make any stupid mistake most likely i do make a stupid mistake but here we are um okay so it's running through of course i don't have a print statement so silly me let's do a print statement here and do it one more time and here we are here we are yeah and um you see we have the original um our original list of um two plus and then i only take the first um the first one of each tuple and i get this resulting list yeah so now we have a list a normal list and of course we can now use um the sum function on that list to get the total number of occurrences so that's the last thing we do we just do here a sum statement yeah so that's all we do and now we have this nice combination of list comprehension and the sum function and we are all done good the last thing i would like is a you see a formatted string so put an f here formatted string and i would now give you the outcome here so the words and when i use it here a placeholder curly curly brackets the word search which is independent in our case was might also inside the double quotation marks i can use single quotation marks okay that works so you have to be careful you can't do double double um you have to switch so i want to have basically quotation marks for this search term was found i know how many times which is total so total times on how many pages on counts pages yeah so that is the stuff we re return just save that for a second and let me just run this one more time and that should give me now um the output i was looking for good so the word independent was found five times on four pages here we are so now we actually have all the stuff we need we have a very neat way of looking for stuff and of course now let's try something else you know let's do some some very common word and yeah so how many times do i get end let's run this one more time you know doing it now manually becomes really really painful it's certainly possible but um quite painful yeah here takes them a few seconds so here you have all the occurrences so the word and was found 1310 times on 65 pages so every page so of course once you scale up and we have many many pdf files you know this becomes fantastic to use compared to doing this by hand okay so the last thing i want to do and this is only if you're really interested i want to refactor it yes i want to tidy this up and move this actually into a function yeah because i want to reuse it yeah so now we do too much stuff manually we do too much you know hard coding i want to make make it actually look a little bit nicer yeah so that's the next step which you might not like to take but if you if you want i show you now how to do this if you want to refactor this good so the first thing you note is um in terms of arguments so what are the arguments these what are the things you need to provide for this to work we need the file name you see that here and we need the search term yeah the rest should run through yeah so these are the things um i basically need um and then um the thing is what what do you actually want to return that's that's a good question i would suggest the best thing to return most likely but this is entirely up to you um is to actually return um maybe um either the list yeah this list pages or you might like to return the count in the total yeah so i i i would leave it up to you what you prefer to return yes it depends of course on your exact problem but let me just um do the following i move everything here into a function yes i mark all of that and use the tab i move it all in and i start here a function and then i start tidy up a bit yeah we don't yet know exactly what we want um i would how do you want to call this maybe the word and page count function yeah whatever you want to call it again i use your lower snake case and what i pass in um is i pass in here my file name yeah you see that here so i just anyway take that i remove it so that's the first thing i pass in i will call the function later so i just move it back down here so we need a file name in here which is a string so i can specify this is just sometimes nice to do it's not necessary okay not necessary and then we have a search term yeah so again this we can move further down so we do something like this okay shift and tap and we have here a search term which again is a string as well okay we don't specify any defaults it's not necessarily massively yeah relevant and let me just look at this more carefully and go through this one more time just think about the logic and here we do a return statement so what i want to return is total and count okay um and in the end in here i could then use this function and run it and let's just do that so further below i just call the function and let's do that and we specify here particular file name and a search term so that is fine that should do it the um thing of course that we could now of course call this maybe result or something otherwise it gets gets quite long and of course now the result contains total as the first um first element of the two proof so we put here not total we would have here result zero and we have then here result one yeah again if you have any questions on what i'm you know doing here let me know so this is just normal refactoring so make sure that so here we basically apply so this is the application and of course you would usually move it into a different file um obviously but you just keep it now in here just for testing good let me just go through this one more time in terms of logic so again we specify now the word page count function it takes file name which is a string and search it then passes file name into the pdf file reader which gives me this document which i call doc when i get the number of pages i start creating my list of pages which is a list of tuples that contains all occurrences and the respective page number i go into the for loop through all the pages i take each and every page into the current page i extract the text i use the if statement find all which returns anyway false if i it it's not found if it is found the find all will give me a list of all the occurrences the len function will give me the number of times it appears on one page which i call count page not maybe the best name and then i append the list with the tuple of all the occurrences on the respective page i count the total number of pages which are all the troopers in list pages and then i obtain the total word count using this sum function on my list comprehension and i return a tuple of total and count okay so that's the idea that's the logic isn't it beautiful let's run this and most likely i just made a mistake yeah so that's that's the beauty of it let me try okay we are done so again we see the word end was found 1310 times on 65 pages yeah now the beauty of refactoring is i can now just you know work with this yeah i can simply work with that and modify it yes for instance what i can of course do now is i can just override the result and change directly here the search term but ideally i would actually do it in the next lines i do a new search term here um above we just do another search here and i just do something like director yeah and just run this and modify that and basically let's see what happens and so i can modify it i can run it several times and i can reuse it so we now have a nice function that provides um the total occurrences and the the pages where um where occurrences occur so director was found 31 times on 11 pages now that's great now that's all for today so i hope you had some fun let me know if you need anything else and i see you next time with more problems around data wrangling yeah so i hope you have some great data fun
Info
Channel: YUNIKARN
Views: 43,433
Rating: undefined out of 5
Keywords: Coding, python pdf data extraction, python pdf scraping, python pdf parser, python pdf parser example, python pdf, readerpypdf2, pypdf2, pythonpypdf2, extract text from pdf, python pdfminer, python pdf to text, pdf data extraction python, pdf data extraction, pdf data extractor, python data parser, python pdf editor, pdf text extraction python, pdf text extraction, pdf text extractor, python read pdf file, python read pdf text, python read pdf, python pdf open, programming
Id: y_ORF4FpZYo
Channel Id: undefined
Length: 35min 35sec (2135 seconds)
Published: Fri Feb 11 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.