How to Install Tesseract OCR on Windows and use it with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on everybody uh in this video we're going to show how to get test ract going on a Windows machine and how you can use Python to interact with it uh so let's Jump Right In so Tess RCT is uh OCR uh optical character recognition which is good for extracting text from things like uh PNG and uh jpegs and other image formats as well so the first thing we're going to do is import Pi Tesseract but as you can see it is not installed so we just need to pip install Pi TCT uh not TCT uh there is a package out there on piie but it looks like that hasn't been updated since around 2015 didn't really look too deeply into it so not sure what that is all right so now that we have P testra there we can go ahead and read an image in but to do that uh we're going to need two pieces here we're going to need actually Tess and we are going to need an image to read in so first we're going to do Tesseract download and if we just go to the dock and we'll go ahead and do UB Manheim and download the executable and I forget the syntax of reading in the text python TCT read text from image and it looks like look like that might be doing some kind of conversion there that's showing command line not what I'm looking for let's go ahead and kick off the installer and while that's going we'll see if we can find that command it's just me on this PC so I will select just my user install that don't really need that see if that's finished there we go let me do PI test r want say it's text from image hopefully that'll give us a better result and we'll need that command as well image to string that's what it is so I'm just going to go ahead and copy that and we'll put that in the cell below I'm going to call that variable text and and this is where our image will go so now that we have test RCT installed uh we need to tell python where it can find it so we just need to add that to path since I installed it just for my user I'll go to users and hdden items it's probably an app data there we are and I'm going to copy that we need to add that to environmental variables just so uh the computer knows where to find it and I'm just going to do this for my path put that in there and now if we open up a command prompt if you choose to do it that way we should just be able to do Tesseract and you see it's it's calling the executable in that directory okay so there is one other piece that we'll need and we need to tell python where to find that executable we will need to quote that and because Python and windows don't use the same uh directory separators like that we just need to buil him the other way and I believe that's test.exe and we saw in that post the command that we'll need there we go P test.py testra testra command and we'll just spin back here all right so now we need an image uh let's do human services usually they've got forms and such and doesn't really matter going to do mental health and we can just go ahead download that uh what is this one called we'll call that I should have left that the same I'm just going to call this form one and the only reason I'm not naming it any words in here because I don't want you to get confused thinking somehow the file name is tipping that off and I can go ahead and just put that in documents as well as I should have put this one in documents as well so now in Visual Studio code we have those forms okay let's go ahead and read that in so we'll do form 1.pdf and we get a big screeching error here uh PDF is not supported uh so as I said uh test is going tot ract text from images not PDF files uh there are ways you can convert PDFs to image files uh programmatically with python as well as other tools uh but for this demo we'll just go ahead and open those and take a quick snip just going to use Google Chrome and look for the Snipping tool and that's good enough call this form one and we close that oh and actually that one's still up that's convenient we can just grab a snip and we'll call that form 2 I need to fix that name F2 that should be form one okay so we just need to change that extension to PNG and we can see that it is Reading in the text from that image uh so how can this be useful to us uh well let's say we're trying to classify a bunch of documents and we know that uh documents have certain phrasing printed on them uh we could run a tool to convert them into images or maybe they come in as images is uh people taking uh pictures with their camera phones and emailing them in uh so what we can do is we could uh make a function here and uh we can see that this one has enrollment and we can see that this one has responsibility in it uh we could do and service plan uh might get into the weeds a little bit might insert some characters between the words so we'd need to uh kind of look at those uh you could scan through all the documents make a set to make sure all of those are uh unique uh grab a substring of the text that's read in find similarities uh and do it that way uh but let's keep going with this example uh we can make a function uh classifi doc uh that's just going to take in a file name and then it's going to go ahead and use that to make our text and here we will do file name and we'll say if enrollment uh because we see that there and we will want to uh convert the text to lowercase just in case it thinks summer Capital uh and summer uppercase I just normalize it like that and I forgot my in we'll say return enrollment L if put that on the same level oh usually I forget my semicolons uh responsibility in text. lower return responsibility ability in text. lower oh I did not get enough sleep okay else return unknown doc type and then we can make that function and we can just run it on either one of these so I honestly forget which is which but that is enrollment that is oh unknown something tells me I misspelled responsibility okay there was a typo in there uh and then if we just go ahead and uh actually we just do CNN and we'll just grab a snip here and if you're kind of new to programming in general uh do not think this would be the way to get website data uh you would want to uh do web scraping use something like beautiful soup uh actually parse the HTML data uh to actually get the words from the website uh things like that maybe if you were doing bug bounties penetration testing and um I can't recall the name of the Tool uh fire ey I think it is where you give it a list of UR LS and it returns a bunch of screenshots of all the websites uh that might be a time where you'd want to uh use a tool like this so you know there's certain versions of web servers that are vulnerable that have certain text that displays on the default web page uh you could definitely uh use this to kind of sift through those and see if any known vulnerability uh default web servers we're running but that's enough of that tangent and now we can go back to our code and that is an unknown doc type actually just out of curiosity I want to see what's in that text uh so we could have put World politics in there and uh said hey that's CNN and and so on okay so where would we go from here uh let's go ahead and we will make some directories uh we're just going to make a enrollment we'll make a responsibility uh we will make a unknown and you know what let's go ahead and make an error directory I really should have made a directory specifically for this project uh sorry the default Windows folders are kind of uh are kind of muddying the waters over here so what we can do is we can go ahead and use uh shutil uh which is a package that allows us to uh move around files and interact with the file system in just a a nice kind of way uh so what we could do is uh if enrollment is in there we want to Shu till. move uh what do we want to move uh the file name that was given and we want to move that to uh we'll use an F string and just do the current directory uh into enrollment and we want to keep the file name so we'll just file name and we can just copy that move that down uh but that will be responsibility and and otherwise uh we can go to the unknown directory uh so uh we will actually we can make a list and just type out the names of these files uh but we could use um the OS package uh which is a built-in package in Python and uh we can just import that and I believe the syntax is for [Music] root file in os. walk and dot dot being the current directory uh so if I was in another directory and I wanted to search a directory where I'm not writing my code I would do the C colon SL users or uh say it's a network drive say a G drive that's mapped uh but in our case it's all right here so uh and actually I probably want to do files and then for file in files because I believe files returns a list and I'm just going to go ahead and confirm that by printing out file okay and with this I could say if um let's see PNG in file. lower uh print file and there's our PNG files so I could say okay if it's a PNG file I I'm going to go ahead and run this function on it and if we look now no errors unknown is indeed our CNN uh responsibility was found in form two um and the reason I put errors in there we can move these back just to give us our reset you know what I'm going to click that so it quits asking me uh if I go ahead and make a new file error.png uh with image files I mean it's all code so at the beginning typically there's heximal and whether you have the extension or not uh it tips you off into what kind of file it is so you can uh trying to think of how to phrase that so you can kind of move a slider if you know what hex codes to look like whether they be uh eight hex codes long or four hex codes long and that's kind of computer forensics when uh people delete files there's no pointer in the file system uh to those images anymore but uh there are tools out there that let you go through the hard disk where there's no pointer to it saying hey there's a file here uh but then if you see that that specific series of hex codes you can be pretty sure that that's an image file uh read till the end of the hex code sometimes it's overwritten um but anyways I expect this to fail and that's what I want to demonstrate here uh so so let's go ahead and we will uh go ahead and not really sure why I got that error uh but let's go ahead and we are just going to run it on the one error [Music] file and actually I'm going to make a file list modularize this a little bit and I'm just going to say uh file list. append and what do I want to append the file so then if we look at file [Music] list okay and if we just do error.png [Music] okay so how do we handle when an image comes in and it's not really an image well we can do a try and exception we will do try and rather than crashing in a horribly ugly error code uh we'll go ahead and catch that and we can do accept exception as e and you know what we will just move that uh to the error directory what we want to move uh the file name we gave it where do we want to move it lowercase f uh we want to move it to the errors and we will keep the file name and you could improve on this a little bit you could have a uh maybe a error log file let you just kind of look at a glance um what files aired out so you can look um and then deal with them uh as needed and now if we go ahead and run this well we shouldn't have gotten that oh I probably need to rerun that to update the function uh text word it's now associated with a value oh and oh it did move it but then it continued with our code so what happened was it tried it it hit the exception it did was it what it was supposed to do here and it moved moov the file but you know what it never the reason we got the exception was it wasn't able to read the text and then it went down here and said okay for text and wait a second I don't have a variable named text what are you doing so if we try that again now all right no error we do get a negative one just kind of a hint for us saying hey something went wrong and it was moved to the errors directory uh and these return codes depending on the size of the code you're writing uh can be very useful so uh say I've got 500 lines of code and I make this one 55 well if I'm running this program and I see hey it returned 55 uh well I can go into the code and say hey this is the part of my code that returns 55 something is triggering it what's going on here uh but I'll just leave that at1 for now uh make sure that that is up to date uh we have that piece there our file list should still be the same and then I could just do for file in file [Music] list and what do I want to do I want to classify the doc and there we go responsibility unknown enrollment and our error okay I think that's probably almost enough uh there is one other thing uh that isn't necessarily having to do with this video but I'll just throw it in anyway um because we're operating on a list and using a function uh you can use uh the multi-processing module uh you can use pooling to use your CPU cores uh kind of run those in parallel and speed it up uh you really won't see that with this but imagine uh you have thousands tens of thousands hundreds of thousands of files U maybe even over a million uh in that case you might want to look into Distributing distributed computing uh where you're running you're breaking up the data passing it to different machines out there uh and having each one of those workers work on a piece uh but until you reach the limits of your local machine uh you might want to be familiar with multiprocessing and I also want from multiprocessing import pool um let's see CPU count equals multiprocessing dot CPU count uh that'll work one's a function one's a variable run that and let's see how many cores I [Music] have okay one other thing I should mention if you going to be processing a lot of data on your local machine and you still want to use your machine while you are doing that to not use what I just showed uh because that's going to use every core on your CPU which will make your machine very slow possibly inoperable and may lead to lots of unpredictable Behavior I know from experience so uh you could do minus equal 2 give yourself two cores you deserve it and if we look now that's four uh and we can do uh with pool and it you can put an integer in here 4 three um anything less than your CPU core count um but I made that variable so we can pass it in and what are we going to do we are actually going to map a function it takes the function name uh so you don't need the parenthesis we're not going to call it it's smart enough to know my first argument is going to be a function uh and in this case it's going to be classify Doc and what do we want to iterate over our list which is file list and I can already tell you right now this is going to complete very quickly and not run or it's going to just run forever ever um so what happened there the answer is I really don't know uh but I do know uh there are issues with What's called the Gill the global interpreter lock when you're on Windows uh so if you're doing this you can do it on Windows but it does need to be in uh underneath the main portion uh that if name equals main it needs to know uh it's running as the main program I believe uh we can fix that by uh making a classify doc python file and what do we want we'll need that import just grabbing all my imports here um best practice is to keep those at the top and just two more and just is best practice uh you usually do uh python built-in packages uh third party party packages multiprocessing processing is built in um and then any third party packages ones that you have downloaded and then anything after that if you make your own custom packages those would go next uh shortest to longest and then divided into those three sections uh I don't know where I picked that up it was years ago not really sure if it still applies uh but I try to stick with that so we are going to make our function here um and then we are going to go ahead and make our file list and down here we just need to do if name equals Main and if you're not familiar with this you can just kind of wave your hand at it uh it's very good to know but the important thing I want you to keep in mind is if you're using the uh multiprocessing pool uh you want to put that uh with pool underneath that and fix the indentation actually need to be moved over and we make this a function get [Music] files and after it's yeah I'll go ahead and make that list still I was thinking about doing a list comprehension but you know what let's let's just return pile list and we can just say files equals get files CPU count equals multiprocessing CPU count now that doesn't look the prettiest but that'll probably work all right and here we have our classified do program let's see if that runs no uh it's because I've got these here okay I see what I did there I need to actually use the pool and use uh the map method from pool uh so that was my bad all right so now we see those have moved uh we can go ahead and reset those here real quick uh if you were actually testing this with a lot of files uh you could modify something similar to this uh I don't know call it reset uh go One Directory deep and say hey find me all pngs jpegs uh Tiff files things like that and move them back to the current directory I just wanted to show also that uh if you go ahead and run that here oh let me go ahead and rerun these cells I was doing a quick bit of troubleshooting there make sure I've got my file list let's see and doesn't really matter cuz it's going to be a short run or it will just run forever because I believe what's happening is it's hitting locks or at least it thinks it is and it's backing off and then it's trying to uh run another one and it's just getting confused in kind of a traffic jam that way uh and then that's a good demonstration of why when you're on windows at least and let me go ahead and let this crash to a fiery death if we go ahead and move that down to the bottom here ah we can leave that there that's fine it'll run just fine uh but again that's all for this video hopefully uh this gave you uh a nice introduction of how you can set up test rack on Windows as always I highly recommend people use Linux it's much easier uh not as much hassle at all um how you can OCR documents extract the printed text from that and maybe gets the idea juices flowing a little bit about uh if you want to make a data engineering or data science project um you could use things like this as well as thinking about scale and how you can divide your problem into units of work and how you can speed up prod processing uh by doing so uh but that's it for this video and I will see you in another one
Info
Channel: Data Slinger
Views: 5,838
Rating: undefined out of 5
Keywords:
Id: GMMZAddRxs8
Channel Id: undefined
Length: 37min 6sec (2226 seconds)
Published: Fri Nov 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.