Extract Text From Images in Python (OCR)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] what is going on guys welcome back in today's video we're going to talk about ocr which stands for optical character recognition and we're going to use python in combination with tesseract in order to extract text from images so we're going to take images that contain some text on the site maybe on street signs and we're going to take images that contain mainly text and then we're going to use ocr in order to extract the text from the images and for this the first thing we need to do is we need to install tesseract it's an open source ocr engine and you can find it on github.com tesseract ocr tesseract and in here you can scroll down to installing tesseract and you can go to install tesseract via pre-build binary package if you want to do that and if you click on this you can scroll down where is it down here to ubuntu debian windows and you can if you're on windows click just windows tesseract at ub manheim and then you're going to get an installer for windows here by just clicking on the 64-bit for example downloading it running it i'm not going to do that right now because i already have test direct installed now the important thing that you need to do here is when you install it look at the directory it is going to be installed in and add this directory to the path so in my case i'm not sure where it was tesseract let's just open the file location of the uninstaller open file location there you go so we have this tesseract ocr this is the location in my case so for me it's c program files test rec.ocr what you want to do is you want to go to the environment variable so you want to go to edit the system environment variables and you want to click here on environment variables you want to go to the path you want to edit the path and you want to add a new entry with c program files tesseract ocr and what you also want to do is you want to add a variable here called test data underscore prefix and this is going to be the test data directory so inside of the tesseract ocr directory you have this test data directory here and uh in here you can download the languages so by default i think you only have english you can download the trained data files from the github repository so for example here i have uh the german one if you're interested in having multiple languages but in order to find them you need to have this variable here all right so that's basically the tesseract installation what you also need of course is you need pi tesseract because tesseract is not a python module tesseract is the engine and if you want to use it in python you have to install pi tesseract so you open up the command line and you say pip install pi tesseract like that and of course you can also i think install tesseract here directly but for some reason doesn't work as well for me so i would recommend installing it from the installer and adding it to path and then just using pi tesseract now before we go into any of the python code we're going to do everything manually with tesseract in the command line so that you see how this basically works and for this i have a bunch of images here for example here i have a basic image containing some text let me just see if i'm not blocking it not really so i can leave it like that it's basically just a text with you know just just an image with basic text but of course we cannot select this this is still an image we cannot just say copy and paste as text it is still part of an image in order to extract it this is of course a very easy thing a very easy task we can use tesseract for this we're going to open up a command line and we're going to navigate to the directory where this file is whatever it is in your case i have a shortcut here um [Music] and basically i'm interested in this text.jpg file now how do i do this very simple tesseract then the image so text dot jpg and then std out to print the result of the ocr extraction um onto the command line so by doing that you can see here instructions and i get the text here as a result so this is a very simple thing because of course the image is quite uh straightforward we have very well uh readable text so let's take something that is not as easy like logos for example here we have a bunch of different words let me just see again if i'm blocking anything let me just make this a little bit smaller here we can see different words here we can see sony paypal microsoft some of them should be recognized in order to see if that works we're just going to do the same thing and we're going to apply this on logos.jpg and this is going to be a little bit more messy as you see we can see here paypal sony microsoft bosch and so on but then a lot of stuff that doesn't really make a lot of sense we can see nvidia here so it partly works probably not and um this is the challenge of ocr because what we also have is we have certain settings that we can use and for this i'm going to copy and paste something that i have prepared here this is just information so here you can see the so-called page segmentation modes and we're going to talk about them briefly in a second and we also have the so-called ocr engine modes that we can choose so i'm not sure why here it is not green okay let me just add some comment here in the top so that we have this in green so here we have the page segmentation modes and here we have the ocr engine modes and we can combine these settings to get different results and how do we do that we basically call tesseract and then we specify the individual settings so i can do again the same thing and now i can say in the end dash dash and then psm which stands for page segmentation mode and a number from this list so i'm not even sure what's the default to be honest but basically you can see what they do so 0 says orientation and script detection osd only then we have for example something like assume a single uniform block of text this might be useful for an image like this it doesn't make sense for an image like this because that is not a uniform block of text that just doesn't make sense then we also have the same thing for vertically align basically we're choosing a setting to say okay i give you this little hint that this image is basically just containing one uniform block of text and this makes it easier for the algorithm or for the process to be done in a in a good way and otherwise here we have for example automatic page segmentation uh find as much text as possible and nor protect no particular order this is also possible maybe we're going to get better results with that so let's apply to the logos again with psm11 and you can see that we get a little bit more we can see vodafone here i think we didn't have vodafone before right yes so it recognized something new here uh it just finds more and sometimes it finds even stuff that's not there so um that's of course the trade-off and this is just something i wanted to show to you we're not gonna go through all the settings but we can choose these psms these page segmentation modes to get better results if we know something for example if we know there's just one single world uh word then we're going to get that maybe we can try this on the logos to see how it fails um there you go we got seats well doesn't make a lot of sense but uh this is what you get from the different modes then we also have the ocr engine modes and here we basically decide what model is being used for example we can say uh neural nets or legacy engine i'm not even sure what legacy engine means but lstm is long short-term memory uh basically a recurrent neural network you can choose the type of engine that we want to use and you can combine it if you want to use it you basically say addition in addition to psm or also without psm dash dash oem ocr engine mode and then zero to three so for example we can go with three we can go with two couldn't load any languages okay this failed for some reason and this brings me also to the next point by the way we can also specify the language so i can say dash l and choose uh deu for german and it's going to try to find like german words but yeah doesn't really make a lot of uh doesn't really make any difference in this case so those are the basic settings this is how you use tesseract by the way just one thing that i want to show you here as well you can use tesseract um neural deer there you go you can use tesseract also for saving into text files so we can say tesseract and then text.jpg and text.text so this is going to save the result into text dot text so we can open this up not here here and you can see that we have the result in a text file so this is how you use tesseract in the command line all right now let's go ahead and do all of this in python first of all we're going to import pi tesseract that we installed before we're going to also import pill dot image and we're going to import cv2 now if you don't have these libraries you need to install via the command line uh pip install pillow and open cv dash python like that those are the two libraries here and what we're going to do now is we're going to start by creating our config so we're going to say my config equals and then r and string and we're going to say dash dash psm and dash oem and now we can choose some numbers um basically for the first image we're going to work with text.jpg and for this we're going to assume a single uniform block of text so we're going to choose 6 as the psm and we're going to go with the default oem 3 based on what's what is available this is our config and now we can use a simple function called image to string to get the text so we're going to say text equals pi tesseract dot image to string and we're going to pass here uh pill dot image dot open and we're going to pass image dot actually not image text dot jpg and as a config we're going to pass my config and then we can just print the text like this so this is enough and as a result you can see we get the text that is stored in the image that is part of the image so that's a very simple thing and for text jpeg it works obviously very well we can go with logos.jpg now now for this probably we should change the config because well it's not a single block of text still recognizes some of the names here but which one would be the best let's see fully automatic page segmentation single column of text of variable sizes single single nothing single probably sparse text yeah i mean 11 we already tried this in the command line i think 11 is good i think 12 is good and i think uh these automatic things are good as well so we can try something like uh let's try three i'm not sure three is going to work well we can try it and we're probably going to get at least better results than before uh not the best but we got some pretty good we got microsoft paypal sony canon kenwood mobile boss oracle yahoo rolex so let's try with two you can play around with that again this is just playing around you can you can try and see which one gives you the best results for what you try to do what do we have here we have a problem because this mode is not supported i guess so let's try 11 again we already did this but yeah so let's see there you go we can see a lot of text here all right so let's try another image let's try the signs basically here we have let me just see am i blocking anything no okay here we basically have street signs where we have village center width limit blandfords bettisbury whatever we want to see if our model or if our ocr is going to be able to do that so we're going to say signs dot jpg and for this we're going to try to find as much text as possible so the same settings as before we're going to run this and what do we get we get center we get upton we got village we had spattersbury we got blandford a bunch of different things yeah that's basically it but we don't get everything because for example we didn't get width even though it was part i think we also didn't get limit yeah so we didn't get everything there again we can play around with the different modes uh but this is how you turn basic images uh or how you extract basic strings from images now what we can also do here is we can plot rectangles around the recognized characters so we can see which characters are recognized by the ocr and which are not and for this we're going to have to change uh the approach a little bit so we're going to say image equals cv to imread so we're going to use opencv instead of pillow because we're going to draw the rectangles with opencv and we're going to pass the file name uh let's start with a simple one text.jpg and then we're going to basically just extract the shape so we're going to say shape uh height uh width and channels equals cv2 or actually image dot shape now i'm not even sure if we're going to need channels so let's turn this into an underscore because i think it's just a return value that we don't need even though i'm not sure about that we're going to see later on um yeah so basically we get the height the width of the image and then we're going to convert this image to boxes using pi tesseract so we're going to say boxes equals pi tesseract dot image to boxes we're going to pass the image and the config myconfig and let's look at the result to see what we're actually working with so we have boxes let's run this and what you're going to see is that we have these individual boxes here so let's scroll up quite a lot of boxes we have the individual characters and we have the coordinates so those coordinates are basically uh the rectangle coordinates so we have the uh upper left corner and the lower right corner i don't know what the last parameters to be honest so we're going to ignore it anyways we don't really need it and what we're going to do now is we're going to see four box in boxes dot split lines so we have split lines so we have the individual uh characters as a separate box we're going to say box equals box dot split on spaces on white spaces and then we're going to say cb2 or actually image equals cb2 dot rectangle we're going to pass image and then we're going to pass here uh for direct angle function we need to pass point 1.2 which means upper left corner lower right corner and that's going to draw a rectangle and here we need to have integer of box 1 not 0 because 0 is the actual letter and then um height minus integer of box two because the height is inverted and then we're going to pass also the integer off box three and height minus integer of box four then we also need to pass a color uh keep in mind here that we're working with a bgr color scheme in opencv not with an rgb so it's not red green blue it's blue green red so if you pass 255 0 0 you're going to get blue and not red but we're going to pass 0 to 55 0 which doesn't really matter because it's in the middle so it's green and then we're going to pass 2 in the end and we're going to display this cb2 in show image image so this string can be chosen it's just a title doesn't matter how you name it and then cv2 weight key delay of zero so if we run this we should be able to see the boxes there you go you can see the image and the boxes around the recognized characters now i think that uh this here is not actually a character is it let's see where's the text uh where was this actually or was it l i yeah then it's yeah it's l i just plotted above the characters okay so this works very simply or very easily for this particular image now let's apply it to logos to see what it recognizes there and we're going to see that it's not as good as on the other image but it still recognizes quite a lot so it sees okay um this is recognized as two characters so we see b and c but not really the i uh this is recognized as a character we can see mastercard being recognized as two characters samsung asks one character and maybe a lambda in here or an s we can see that sony works paypal works microsoft bosch works uh yahoo works and so on so partly it works probably not and now let's go to the actual signs and we're going to see that there it doesn't work very well at least it skips a lot or it finds a lot that is not actually a character so these two things are recognized this one is recognized i don't understand why this one is not recognized those two words here are also not recognized but but uh we see here that a bunch of leaves and a bunch of trees are recognized as characters so this is not optimal and of course you can also let's go back to logos you can also see how the results change if you change the mode so let's say we assume um well let's change 11 to something like 2 or something like 3 for example let's change it to 3. we're going to see that the results are quite different there maybe not for 3 and 11 oh actually this looks a little bit better yeah this actually looks way better than just using 11 but if we now say okay we assume a uh a uniform block of text so let's go with six this is not going to work very well with the logos because it's not a uniform block of text obviously but actually it still works quite good so let's let's pick the worst option let's say it's a single word so let's change this to 8 and then we're probably going to see very bad results hopefully uh yeah there you go so if it thinks that all this is a single word you can see that the boxes are also uh not very well placed so let's go back to 11 and uh what we can now do as well is we can go ahead and not only plot boxes around individual characters but around individual words so we want to say okay uh obviously if we go to logos we will see that certain characters belong together so ebay will not be recognized as e b a and y it will be recognized as one block and thus as ebay and because of that we want to plot boxes around the individual words not just the individual characters now for this what we're going to do is we're going to remove all of this here and i think we can keep this cv2 in read stuff uh yeah this should work and then we're going to use a different function we're going to use the function image to data so data is going to be pi tesseract image to data and we're going to pass the image we're going to pass the config and we're going to pass an output type and for this we're going to have to import something we're going to have to import from pi tesseract import output and we're going to choose output dot and here you can see we can choose data frame which is i think a pandas data frame we can choose dictionary we can choose bytes string whatever we're going to go with a dictionary for today's video and then we're going to print the data in order to see what we have in there actually let's just comment this out so we don't see anything and you can see here we have a level and we have a bunch of values and then a bunch of more stuff so maybe we should print the keys to see what we have in here so the key should give us information about what we can actually extract we have level page number num parnum line num word num left top width height conf text confis confidence text is the actual text and i think those are the things that we need for the plotting of the boxes so let's go ahead and see what we have in data text so that we see how this library actually works and there you go you see all the recognized words okay that's interesting so what we're going to do now is we're going to say uh for or actually let's say amount amount of boxes is going to be length of data text and then we're going to say for i in range amount boxes and we're going to say if the float version of data confidence so if the confidence is above a certain threshold so let's say 80 for example which is a confidence of 80 percent if the confidence is as high then we're going to plot a box because otherwise if we set the confidence too low it's just going to plot boxes everywhere so we're going to say if the confidence that this is a word is above 80 we're going to say x y width and height are going to be data left data actually let's turn this into a tuple data top and of course we want to do this for each instance so we need to provide i as well here as an index then data what was it with i and data height i so we're basically getting the top left corner of the rectangle and then the size and this is enough of course to derive the rectangle itself and because of that we're going to go into the next line and say image equals cv2 dot rectangle we're going to plot on to the image and the rectangle is going to start at x y and the lower right point is going to be x plus the width and y plus the height and of course don't forget the bgr color scheme again 255 for green and then two again and then what we're also going to do is we're going to put the text off the images or the recognized text below the boxes now we can do it without as well so then you will already see the boxes around the text so if we run this we will already see the result at least if we didn't make any mistakes of course we did some float argument must be a string or a number not a list uh i think we need to add an i here right there you go there you go so now you can see the recognized words paypal sony hp ebay microsoft bosch oracle boss philips ck canon and so on but we don't see what the actual recognized text is because maybe it makes some mistakes maybe it says ck is a word but i see okay so in order to see what it actually does we're going to say cv2 dot put text onto the image we're going to put the text which is data text we saw in the keys that we have this text uh key and we're going to pick of course the ith one and we're going to place this text below the box so we're going to place it at x and y plus height plus a little bit more 20 for example we're going to choose the font cv2 dot font uh hershey simplex we're going to pick a scale of 0.7 so we're going to make it a little bit smaller than the default we're going to pick as a color 0 255 0. we're going to pick 2 as a thickness and we're going to pick the last parameters which i forgot what it actually is for i think it's optional but i chose in my prepared code line a8 so that should be actually enough and we can run this hopefully i didn't block anything i don't think so um but there you can see now what it actually recognizes philips ck bosch microsoft paypal sony ebay uh here it recognized i don't know what this is oh or i don't know what it is canon mobile question mark for burger king uh gw for cnn which is you know i can't understand this this could also be a w uh yeah so you can see what it actually does now in the case that i blocked the code this is the code let me just close this this is the code we're basically plotting a rectangle and we're putting a text onto uh or below the box so let's swap the image let's go with the signs to see if it works as well let's run this and you see blandford spattersbury upton village center then it says la in the trees here i mean this could be la if you uh have a good imagination and of course we can also lower the confidence so we can say okay if you are uh more than 20 confident then show me more and maybe we're going to get more text based on that um and you can see we get a lot more misclassifications but not really anything useful so we don't get any of these words just because we lower the confidence but we get a hay up here in the tree so yeah that's not really intelligent in this case we can go to logos though and see if that changes anything so if we go to logos.jpg run this again with a lower confidence threshold yeah we can see way more uh but not necessarily correct stuff so uh adidas is a s uh then we have youtube wid so yeah it's not really a good thing but nvidia was classified i think nvidia was not cloud uh not recognized with a threshold of 80. but yeah you can play around with the settings here this is how you do ocr in python so that's it for today's video if you enjoyed hope you learned something if so let me know by hitting the like button leave a comment in the comment section down below and of course don't forget to subscribe to this channel and hit the notification bell to not miss a single future video for free other than that thank you much for watching see you next video and bye [Music] you

Info

Channel: NeuralNine

Views: 189,068

Rating: undefined out of 5

Keywords: ocr, optical character recognition, text recognition, python ocr, tesseract, pytesseract, python tesseract, computer vision, image to text, python, extract text from image

Id: PY_N1XdFp4w

Channel Id: undefined

Length: 29min 23sec (1763 seconds)

Published: Thu Oct 07 2021