Extract text, links, images, tables from Pdf with Python | PyMuPDF, PyPdf, PdfPlumber tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to this tutorial I'm gonna talk about the best python libraries for processing or working with PDF and by best I mean popular and also easy to use for different purposes I just published this article on my website pop pythonology.eu you can just check it out you can subscribe to my newsletter and I'm gonna send you once a week only python related articles with all the code Snippets here so for now I have mentioned Pi PDF PDF plumber and my favorite Pi mu PDF I've saved the best for last you can see an introduction here then why Pi PDF for example if you want to use some simple extraction of texts and images you can use Pi PDF and I you can see I've included all the code and here is the form for subscription to my newsletter I really appreciate it if you do then PDF plumber how is that different obviously you can extract text but you can also extract tables and that's the advantage then Pi mu PDF why this one well I love this one not only because you can extract text but also you can extract the metadata data table of contents also convert your PDF file into an image and also you can get links from a PDF file you can you can print all the links which are which exist on a PDF file so for this tutorial I'm using this PDF file it's just like a couple of internal links to different sections of this text it's just random stuff and then I have a link to Pi PDF for example I have a table here and an image another link here and that's it just two pages and I'm going to use Google collab to write my code and I can see I have also drag and dropped that PDF file here okay let's start with pi PDF there are different versions of it Pi pdf2 Pi pdf4 so for now I'm just going to extract the text and images from file and this works okay so pip install PDF Pi PDF and I've used exclamation mark because that's how you do it in Google lab then I have imported PDF reader from PI PDF and run okay now the first thing we need to do is to create a reader object because we are going to read some file and what is the file so we are going to ask the PDF reader that we just imported to write the file to read the file dot PDF this is our PDF file so now we have an object reader which have access to all the methods from this PDF reader perfect what can I do with it well I can see the length of the the pages of this PDF so I can simply print at the length of the reader.pages and well not nine but this and one more here okay of one more there here now if I run this you can see two means two pages I have so dot Pages property we add it to this reader object and the length of it that's it now if I want to grab the first page for example of uh this so I can say page is equal to reader reader dot pages and I can pass in an index from 0 means the first page and one means a second page so I'm gonna have access to the first page in our file what can I do with it I can extract the text so I will print page that we just grabbed and use the extract text method with parentheses and let's print it and you can see two refers to the number of pages and this is the first page into text awesome you can see this was a table that we had which has not been inserted correctly this will be corrected using PDF plumber you will see okay so this is how the table was all right so now we have extracted the text from the first page that is indexed zero so how do I do it if I wanted to go through every page and extract the text of every page well we can simply use a for Loop I can say for I or for page for example in so for for what for I in I would specify a range because here I need to put in a number right 0 1 2 3 4. so I want to know first of all how many pages there are and that means I Wanna Know The theoreader Dot Page the length of that file so now this is this is 2 for my file for range for I in range two that I would be zero and then I would be one perfect so now I have numbers to fill in this so I'm going to copy this part here and instead of zero I'm going to pass in I so page is going to be the first time the first page the second time the second page now what I can print is going to be page dot extract wait for it yes image so extract text now I will have to also comment this out so that I don't see these text here now I'm gonna run this and let's see the two pages appearing perfect so I have access to the two pages of my PDF as you can see perfect okay so that was for the pages now what about images because here you saw that I have an image how can I grab that image so what I can do I'm gonna use four let's say I and I here would be just the image object for i n page dot images so this means for I in page.images so now I'm gonna go through the images in my page and page here means page zero right index 0 page one so it's going to go through the images and then what's gonna see is I'm going to to find the image and write it and save it here so that's why I'm gonna use a width open statement which says open that I dot name so grab the name of that thing and use this mode of WB writing by binary if you don't know what I'm doing so I'm just is file handling in Python I have a tutorial for that that is you open a file to write on it and you're just going to refer to it as whatever for example f for now and I'm gonna write F that write I'm going to write on that F the data of that image so that is how it works so now again I'm gonna go through the images of the page and I'm gonna open the that single image the name of it I'm gonna write on it the data so it will appear here if I run this if I close it and let me just open it again and you can see X 16. so this is the name of it now you can see this is it it is too big though but you can you know what I mean so that's the image that we just got okay so that is how you can use also Pi PDF to extract images now what about PDF plumber so I'm going to use PDF plumbers specifically for extracting tables so you can do whatever I did now with the the text with it but I'm going to use this specifically a PDF polymer only for the tables so now I have run this already and installed it so what I'm going to do is I'm going to say with PDF plumber dot open so I'm gonna open that PDF plumber using PDF plumber I'm going to open one file which is file.pdf and I'm going to open it as just F referring to it as F and what I'm going to do with it I'm going to go through the pages of this and find the tables so I would say for i n f dot pages so that so this F which means the this file and the pages of that file for every page in it I'm gonna print something I'm gonna print that I which is a single page dot extract underscore not text but tables yes it's a method so now let's print this and you can see I have two pages and for every page the first page I have this table it's a lucky python list and the second page doesn't have any tables so that's why it's empty and you can see it's actually here the first page we have this table so by PDF plumber is a great Library if you want to use use it to extract tables you could also extract text just as I did here before but now I'm gonna get to the last one the um Pi mu PDF and this is the import name Fitz apparently this is a kind of Interest this has some interesting history so you can go to the documentation and check why the name fits was chosen all right now that I've installed them and I have imported that now the first thing you need is to create a document object I'm just going to call it DOC is going to be fits dot open and the name of our PDF file which is file.pdf so this is how you create a document object of your PDF you could also say fit dot document then file.pdf but this is what I do now that I have access to the methods of this document so the first one that I'm going to use is I'm going to use doc Dot Page count just to see how many pages there are and you can see two pages okay nothing special now another thing is to grab the metadata like the author and all that so I'm going to print dot dot meta and let's see so you can see we have the format of the PDF there's a title which is Untitled author no one subject nothing and Creator and you can see here Google Docs has rendered that as well and there is no encryption perfect this is amazing so now we have this metadata what else do we need well how do we get the pages so I would say page is equal to Doc dot load underscore page and the index of zero that is how I get access to like page one and if I want to grab the text of page one I can simply say print page dot get underscore text is a method so parentheses and you can see I have it and the table is not as good as PDF plumber though but yeah this is simple nothing special yet now what I want to do I want to turn this PDF into an image so what I can do is well let's just close this off and right here I'm going to create a variable called pix is equal to that page that we just created up there like page one here and I'm going to say dot get underscore picks do I have any suggestions pigs map yes so pixmap it's a method so page dot get underscore pixmap now that we have this I'm gonna save it somewhere so I'm going to say picks dot save under which name shall we save it let's have a F string here I'm going to say page underscore and inside curly braces because I'm going to use some python stuff here I'm going to say page DOT number and page.number is another property that gives us the number of that page so it would be page one for example uh Dot on a PNG or page 2. PNG something like that so that is how we are going to save that image now I'm going to run this and let's see what happens here let's just close it and open it again and Page 0 PNG so that is our page one let's run look you can see now we have this image this is an image of our first page perfect so what about all the pages well I think you could guess that you can use a for Loop to go through them and just uh print all of them I have mentioned that in the on my website what about the links now so we can also grab the links let's just save them inside this variable links and the page that we just created dot get underscore links and now let's print links and let's see look at this so two kinds of links have been recognized one is internal link that is uh right here for example like this is a link that gets you to another second part this is one link another one is like an external link like this one so you can see that we have a list and it's a dictionary here um all the way kind one kind one until we get to this one here which is kind two pythonology.eu and this is considered as kind two of links so now you can see we have the links here now if you want to get grab the links and all pages again use a for Loop go through the pages and grab the links and I have that code also on my website so you can go and check it out I hope you liked this video it was very simple and if you did please leave a like or a comment down below and check out my website you're not gonna regret it thank you very much for watching and listening
Info
Channel: Pythonology
Views: 99,729
Rating: undefined out of 5
Keywords: python pdf, pdf python, pypdf, pymupdf, pdfplumber, python pdf library
Id: G0PApj7YPBo
Channel Id: undefined
Length: 17min 0sec (1020 seconds)
Published: Tue Jan 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.