Extract Text from any PDF File in Python 3.10 Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how's it going guys in today's lesson we're going to be looking at how we can extract text from pdf files and print it in the console so we can later process that text and use it however we please and it's going to be compatible with any kind of pdf file so for example i have this one that is just a coffee pdf it's about coffee lots of data about coffee and i also have another one which is a sample pdf which is a bit more simple so that we can actually see what we're extracting and when we actually go ahead and run this program we're going to start with the sample.pdf which is the easy one and we're going to get all the text printed to the console as you can see we have all of it inside here so we can just process that later in any way we please now we can also do it with the coffee pdf and if we run that we're going to get a much longer page with all the pdf information so again once you extract this information you can use it in any way you need to so it can be really good for making some quick searches in pdf files and for making it a lot easier to use regex so that's what we're going to be building in this lesson so go ahead and create a new empty python project and i'm just going to create a new file here and the only thing you need to do ahead of time is go ahead and find a pdf file that you want to test such as a sample pdf or a copy pdf or any pdf you want you just need to make sure that this is physical text and not an image this is not image recognition it is just text extraction so find some pdf files that you want to extract the text from and just place it inside your python project next you want to go ahead and open up the terminal and type in pip install pi pdf two and here we're going to go ahead and import pi pdf2 just the way it is and we're going to create a function called extract text from pdf which will take a pdf file of type string and it's going to return to us an array of strings now the first thing we need to do is open the file so with open pdf file and we want to read this as bytes so we will add this rb string and we want to import it as a pdf now the reader is going to equal api pdf2 and we need the pdf file reader which is going to take a pdf and we're going to set strict to false next we need to go ahead and create a list of the pdf text that we're going to extract from this pdf file and for each page in the reader dot pages we're going to extract the text so content is going to equal the page dot extract underscore text and we want to go ahead and call pdf text dot append and we want to append the content then we will just go ahead and return this pdf dot text and that's all we need to do to extract the information from the pdf file and add it to a list so we can use it later then we're going to go ahead and create a main check and inside here we're going to go ahead and get the text so extracted text is going to equal extract text from pdf and inside here you can insert your pdf file i'm just going to be using the sample pdf from earlier and for each text in this array we're going to go ahead and print that text so now if we go ahead and click on run we're going to be able to extract the text from the file and this is going to work also for the coffee text as you can see we have all the text from the coffee pdf and as i mentioned earlier you can add whatever kind of processing you want i definitely recommend you do something such as use regex and we actually have to import regex for that so with this line of code we're just going to recognize words as they are without punctuation so if it says hello with an exclamation mark it's going to remove the exclamation mark so we just get the word hello back and with this being done we can go ahead and make some very simple checks such as if the word coffee is in the split message array then we can go ahead and say coffee count plus equals one and we need to create that above as well so coffee counts is going to equal zero initially and at the end of this we can say something such as coffee found and we're just going to go ahead and insert the coffee count now when we go ahead and run this it's going to check all the coffee instances in the pdf and it's going to find it 151 times in the coffee pdf file so you can do some simple checks like that you can aggregate data that you want all you need to do is make sure that you create some sort of regex and that will make it a lot easier to process the text from the pdf file but otherwise going to the original all you have to do is insert the pdf of your choice and that's going to return to you an array of all the pages that were inside the pdf as text files so it was that simple to create text from pdf files in python and as always guys i hope this tutorial helped otherwise i'll see you guys in the next lesson
Info
Channel: Indently
Views: 36,556
Rating: undefined out of 5
Keywords: code palace, cde palace, code palce, palace code
Id: RULkvM7AdzY
Channel Id: undefined
Length: 5min 18sec (318 seconds)
Published: Mon Aug 08 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.