Extract PDF Content with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys welcome back in today's video we're going to learn how to extract information from pdf files in python so let's get right into it [Music] all right so when preparing the code for this video i wanted to find a solution where i can use just one simple python library to work with pdf files to extract tables to extract images to extract text and so on all with one simple library and even though i think it's possible to do all this with one of these libraries i figured out that it's better or easier to use different libraries for the different use cases so if you want to extract images there's a library that makes it easy for you if you want to extract text there's a library that makes it easy for you and if you want to extract tables there is a different library that makes it easy for you so we're going to do in today's video three sections where we use three different python packages to do three different things with pdf files if you're working on a huge pdf parsing project maybe you should not rely on many libraries but uh use as few libraries as possible so maybe you should go deeper into one library and figure out how to do all the things with one library or even develop the process from scratch but for the uh for for this video today we're going to look at three different packages because i think a lot of you guys will be interested in simple approaches how can i get the images how can i get the text how can i get the tables and uh work with the data that i get there so we're going to do it like that and the first thing we want to do is want to install the package pdf miner 1.6 so pip install pdf miner.6 and this is the library that we're going to use to extract text to process text from pdf files and for that we're going to import also the regular expression module uh which is what we're going to use to find certain patterns of strings uh just so we have something to do with the data that we get i have a video on this module already it's part of the core python stack you don't need to install it and then we're going to say from pdf miner dot high level we're going to import extract pages and extract text now i hope this is uh you can see this and i'm not blocking this with my camera um but there you go so we're going to import that and then we're going to say now for page layout in extract pages and we're going to provide here i didn't show you that we're going to use this sample pdf file here where we have just a simple heading neural nine sample pdf file then we have some text and then we have here a table and then we have an image so we're going to extract the individual elements here so we're going to say now for page layout in extract pages of sample pdf we're going to say for element that we have in the page layout we're going to print that element and you're gonna see that we get a lot of different elements we have um a text box we have another text box and so on then we have the rectangles this is the table uh and then down here we have somewhere the lt figure and part of that lt figure is the lt image but we're not going to use this library here to extract the image even though it would be somehow possible we're going to work with a different library for the images here but this is how you extract the individual elements so you can look at them so let's get rid of let's get rid of all this and let's say text equals extract text and provide sample pdf here this is how you get of course we need to print it this is how you get all the text quite simple that's already it you get all the raw text here also from the table and so on and now we can apply regular expressions we can filter manually we can do whatever we want with the text this is actually all you need to do to get the text so you don't even need to have the extract pages unless you want to go page by page and extract the individual elements you can just to extract text from the pdf file and now we can do stuff like the pattern is going to be re.compile so we're going to compile a simple pattern and we're going to say we want to find for this we're going to add an r here we want to find anything that starts with a lowercase or an uppercase character at least one of those and then is followed by a comma exactly one comma uh and then followed by a space exactly one space so this is a regular expression we say any letter that is uppercase or lowercase at least one but as many as we want followed by one by exactly one comma and followed by exactly one space and now we're gonna look for that pattern in uh in the text and we're gonna say matches equals pattern dot find all and we're going to look in the text and then we can print the results here so we can print the matches i'm not sure if we're going to see the actual matches there you go yeah we can see mike sarah bob john so the purpose here was to get the names we can see letters followed by a comma and a space this is the only um thing we have here now in this case emma will not be found because we didn't specify uh we cannot find emma based on that because emma doesn't have a comma in the space afterwards but you can you can refine that regular expression if you want to and then we can do names equals for example and i can say n up until uh the comma so we're gonna exclude the comma and we're gonna exclude the space so the last two characters for n in matches and then we can print the names and we have a list of names even though emma is missing with that approach so we extract the text we use a regular expression to extract individual names or some of them and this is how we extract text from a pdf file this is quite simple so next we want to look at how we can extract images from pdf files and for that we're going to need more libraries we're going to open up the command line and say pip install p y m u pdf this is a library that we're going to use for that and of course we're also going to need pillow not for the pdf part but because pillow is what we use oftentimes to work with images so we're going to say pip install pillow and then we're going to import now surprisingly fits fits is the module that we're gonna import here this is what we get by installing pymu pdf and we're gonna import pill dot image which is part of pillow so even though the packages are called like this this is what we import here and also we're going to import the core python module io which is the input output module because we need to use bytes io to work with the image data and what we're going to do now is we're going to say pdf equals fits dot open sample.pdf so now we have this pdf object and i'm going to say now that i want to have a counter starting at 1 because i want to extract all the images in this case it's simple because i only have one image but if we have multiple images we can iterate over them and extract them so i'm going to say here counter equals one we're going to make this scalable 4i in range length of the pdf so this is basically for each page what we're going to do is we're going to say page equals pdf i so length pdf gives us the amount of pages in this case again only one page but if we have many pages this is going to iterate over the pages so we get the respective page and then we say images are page dot get images so we get all the images from that page and then we iterate over the images so we say for image in images we're going to say that the base image that we're going to work with is going to be equal to pdf.extract image and we're going to pass here the image and then image index zero this is now the base image now we're going to get the image data which is going to be the base image the actual image because the problem is that uh maybe i should show this here if i print the base image the base image is not the image data the base image is a dictionary with certain metadata with meta information and then we have the image itself uh can i find it yeah the image itself is then now i cannot scroll there you go the image itself is then the byte of the image so we need to get the image key key value paired to actually get the image data then what we're going to do is we're going to say that the image maybe i should call this e ing now because i already have image up here the img is going to be equal to pill dot image which is going to be created by open so pill image open and we're going to pass your i o bytes i o and we're going to pass the image data and then we're going to say the extension [Music] extension is going to be determined by whatever the image is so we're going to say from the base image we also have a key value pair which is the extension ext and then we're going to save the image with a certain file name so we're going to say image or actually img dot safe and we're going to say that we want to open a file stream and we're going to pass an f string here image counter dot whatever the extension is this is what we want to do and this is a right writing bytes stream and then we want to increase the counter by one in case we have more than just one image so we run this and there you go image1.png and we would have image two three four and so on if we would have multiple images on multiple pages so last but not least we're going to look at how to extract tables from pdf files and easily turn them into pandas data frames so that we can work with them in data science processes and for that we're going to use a library called tabula so for that we open up a command line pip install tabular dash py which is again a library that has a different name when we install it then when we import it because we import just tabula not tabula py and what we do now is we say the tables of the pdf file are going to be equal to tabular dot read pdf sample.pdf and if you want you can also provide pages here so you can save pages equals and you can provide something like one or you can even provide something like all and all basically means all the pages in this case it doesn't make a difference because we have only one page and from that pdf file now we already have the tables because tabula already takes the table so when we read a pdf file we don't have the different elements we only have the tables so actually i can just print here tables and you can see uh first of all pages argument isn't specify specified will extract only from page one by default we can also say pages equals all doesn't really matter let's run this again but you can see now we have a list of uh pandas data frames in this case we only have one data frame but if i say print tables 0 you can see that we have the actual pandas data frame here and i can also say type by the way you should also install pandas if you don't have it so you can go into the command line say pip install pandas um but you can see here that the type of what we get from tabula is panda's core frame data frame so it's a data frame that we get here which means that we can store it now we can say df equals table zero and we already have uh the data frame we can go ahead and print it again and we can use uh the functions we can use the methods that we can use on ordinary pandas data frames we can also filter the data by saying data frame is um data frame is the full data frame and we want to get the entries where the h is above 30 for example so data frame dot age above 30. [Applause] and then we get a limited output so that's probably the easiest of all the three so images was a little bit more complex getting the text was not really complex we just had to extract it but this is very very easy you just say read pdf you provide a pdf file and you get a list of pandas data frames this is very convenient and um if you have a pdf file with a lot of tables that you want to process in a data science project or something this is a very useful library so that's it for today's video i hope you enjoyed it and hope you learned something if so let me know by hitting the like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this channel and hit the notification bell to not miss a single future video for free other than that thank you much for watching see you next video and bye [Music] you
Info
Channel: NeuralNine
Views: 192,621
Rating: undefined out of 5
Keywords: python, pdf, python pdf, python pdfminer, pdfminer, fitz, PyMuPDF, tabula, python tabula, python fitz, python PyMuPDF, python pdf parser, python parse pdf, python extract pdf content, python extract pdf images, python extract pdf tables
Id: w2r2Bg42UPY
Channel Id: undefined
Length: 13min 15sec (795 seconds)
Published: Mon Aug 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.