Working with PDF files in Python | How to extract text from Pdf using Python?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Guys in today's video I will tell you how to extract PDF with the help of Python If I show you this PDF, it is a simple book called ProGet, you can download any PDF I want to tell you through this video that to use any Python module the functionality of that Python module should not be lost I am searching Py pypdf2 a very famous module which is used in pdf extraction and if you don't know anything you can land on this page by searching for pypdf2 I will install this module first I will create a file main.py and open a terminal and I am guessing pip install pypdf2 will install it and yes it is installed and the installation process is also written somewhere but I don't care at this point because I want to use it so the first thing I want to do is to extract the text in any way but for that i have to bring this pdf in my python program now if i don't know anything at this point then what i will do first of all i will open pdf file reader class because i am getting this vibe from reader that this class will help me in reading what are its parameters its important and necessary parameter is stream it is a file object and it can be a string which is a path of a pdf to a pdf file so let's do one thing let's import it from pypdf2 thing let's import this from pypdf2 so if i import this from pypdf2 import and if i import this, i do one thing that if there is any example given below, but not given but pypdf2 is like this so i import this here and i will write here import I have not given the example but pydf2 is like this so I will import it and I will write here import pypdf2 and after this I will write pypdf2. and I will write the name of this object what is the name of this object pdf file reader and it seems it is opened pypdf2.pdf file reader and it seems to be open pydf2.pdf file reader and I will give progid.pdf here, let me run it and I am running it, it is running perfectly so it is running here and it is saying now sometimes what happens in VS code you install a module and it tells you unresolved then restart the VS code I don't know why it happens but when you restart the VS code then that problem goes away so if you have installed a module and it is telling you unresolved module then you restart it your py pdf2 vscode and it will start so if I write a is equal to pdf2.pdfFileReader then you can see it gives me a pdfFileReader object, now it has some methods which it has given like documentInfo, read only property that accesses the getDocumentInfo I will see what is document info read only property that accesses the get document info I will see what happens by dot document info I will write a dot document info and run it and see and see here all the things associated with this pdf I have come to know like title pro get author, here is the creator is also here that it is made from this thing title, author, creator and producer and producer and modified date and modified date so this way i can get the info of the document now lets see other functions there is no need to get confused you think a human being will do all these functions, you think a human being will remember them, a human being can search on google and arrive on this page, I am telling you this because this is a myth, a big myth that you should remember all the functions, whether it is beautiful soup module or request module, any python module as a python programmer you don't need to be confused ok, I just want to tell you this, I can take any python body as a python programmer you don't need to keep it i want to tell you this i can run all these functions one by one but i want you to run i am interested in getnumpages because i want to know how many pages are there see that i am printing it 517 pages are there let's see is it true or not I am printing 517 pages ok 517 pages let's see is it true or not 517 pages one out of 517 so if i can read each of 517 pages so for that there is a function get page, it retrieves a page by number from this pdf file let's run this get page so what i will do here I will write page number 2 get page and I am number 2 I am interested in page number 2 let's print this and see this it is saying indirect object it is giving me this page as an object it is saying type page, parent this, media box this like this it is giving me this page type page, parent, media box this is how it is giving me this object so in different ways I have many objects in python but I am interested in text of one page how I can extract text inside it so for that I am searching function here and if I want to extract text from the page let me write the text ok i didn't get anything let me see page object class let me see if i get any luck i have already created a page object can i run its extract text let me run .extract text and see if I get any luck or not so I run it and see its text got extracted it didn't happen that well but I extracted its text so if I extract the whole text of this p text or let's do one thing from 1 to 10 all the pages I have, I will extract their text and put it in a text file, so how will I do it? ok let's see so I will comment out both of them I just did this to show you how you can use the functionality of this module, you have to do is Write for i in range and I need from 1 to 10 I will write 1 to 11 So this loop will run from 1 to 10 And what I will do here I will write here I will initialize a string which will be blank In fact I wanted that Line of code which I have written here I will initialize a string and i will run it like this str plus equals to a.getpage.extracttext ok, i have done this now here i will blank str and by blanking str for i in range 1 to 11 str plus equals to a.getpage extract the text of it and lock it in str and what i will do is write this page in a file with open text.txt open it in a file text.txt I will open it in write mode and what I will do I will write f.write str and I will open it as f ok, so I have opened it like this and now I will run it so I think there is some error here, what is it? it is car map code, I can't encode character this in position ok, so here I have to specify encoding so I specify utf-8 encoding ok I will specify utf-8 encoding and now my work is done, it is run and see here text.pdf is made so I close it and here you can see this file in text form as it is it has dumped all the text of the initial pages now i have done it till 1 to 10 i could have done it till 1 to 30 like this so i I do this, it will have 30 contents so you will get to see more content so whenever you want to convert a pdf to a text file, you can use this method ok, so this is a video that I made I wanted to give you guys an information on how to use a new module because modules will keep coming and you can't keep all the functions of the module this is a fact for ever, remember this so you have to develop the skill to search on google reach the page of the documentation and you can use this module now what happens is there are many services online like concatenate pdf see this ilovepdf.com so all these things use this kind of modules all the products that people have made in websites, in backend they use this kind of things, ok these are the functions they use, if not this then some other module they might be using some other programming language but in most of the cases they use this now you can use PDF file writer class to write the file, you can explore all those things through this video I just wanted to tell you how you can use this module try different functions see what is given here get contents, you will get space content if you do extract text you will get all the text see it is written here locate all text drawing commands in order in the order they are provided in the content stream and extract the text this works well for some pdf files but poorly for others depending on the generator used this will be refined in the future do not rely on the order of text coming out of this function so he is saying that it only converts in text, don't rely on order because it has other functions which will tell you what is in which place in pdf so you can use all those things so I hope that this program will be helpful for you and with the help of this you must have learned to read pdf i want you to explore at least one function and comment below and tell me what is that function and what does it do what does that function do comment below so that's it for this video if you want more videos on python programming then i have this playlist i will add more programs in this playlist thank you so much for watching this video and i will see you next time Thank you for watching!
Info
Channel: CodeWithHarry
Views: 121,051
Rating: undefined out of 5
Keywords:
Id: GxWwBp8SNNA
Channel Id: undefined
Length: 11min 33sec (693 seconds)
Published: Sat Sep 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.