How to Preprocess Images for Text OCR in Python (OCR in Python Tutorials 02.02)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome back to the series on python and performing ocr and python this tutorial series is geared towards anyone with a general understanding of python that wants to perform ocr in it now in the last video i introduced you to a fairly simple library for loading images into memory and that was pillow or the python imaging library the pillow was just a fork of pill what we're going to cover in this video however is a different library called opencv which stands for computer vision computer vision is a complex complex task that requires you to manipulate images or video feeds really in complex ways now the benefit of a library that can achieve that is that you can do really complex image manipulation in python the downside to that like any other programming language or software out there the ability to perform really robust tasks means that the ability to actually do those tasks in code or in the program is necessarily a bit more complicated and it's no different with opencv what i'm going to do in this video is i'm going to walk through all of the key steps to image processing to get better results with pi tesseract or tesseract in general with opencv and these image manipulation steps are usually something i would cover in individual smaller videos but because i want you all to use this video as a one stop shop i'm making an exception here and this is going to be an exceptionally long video but i'm going to have everything time stamped in the bottom down below in the description so that you can skip to the task that you need a refresher on i encourage you for the first pass of this video to sit back grab a cup of coffee and just watch because i'm going to be going over nine different main ways you need to manipulate an image or could need to manipulate an image in python to have better ocr results with tesseract so the first thing we're going to talk about are inverting images and why you might not want to do them in tesseract 4 we're going to also talk about rescaling which is when you adjust the dpi of an image which isn't going to be necessary for some images we're going to talk about binarization the conversion of an image to black and white we're going to talk about noise removal which is what it sounds like the removal of noise from an image which can oftentimes produce poor ocr results we're going to talk about dilation and erosion and you'll see what that is this is when you have bleeding of of text in old historical documents and this is a way to kind of eliminate that which will improve ocr results we're going to talk about rotation and de-skewing this is oftentimes the result of a bad scan when the page is at an angle we're going to talk about removing borders this is one of the best steps you can do oftentimes you can have a very very clean page and a single black border will ruin everything for you we're going to talk about how to remove those borders and then also the inverse problem missing borders we're going to talk about how to add borders in so that you can have better results and then we're going to talk about some unique cases where transparency and alpha channels might affect your your ocr results i'm going to cover each of these in the stupider notebook which i'm going to put onto github so that you can actually go in and use this as a cheat sheet if you want copy and paste things from here and use them in your own scripts but i encourage you to sit back and watch all of these steps because not all these steps will be relevant to you right now but they will be in the future at some point at some point you're going to be working with a document that was very badly scanned and you need to rotate it and de-skew it at some point in the future you're going to have a scan that's got borders that you're going to have to remove you're going to have to know how to do all these things to solve different ocr problems during the course of your career so you're not going to have to do all of these for a single task but you're going to have to do all these at some point for different tasks that's going to be what we cover in this video moving forward but before we get to that the first thing we have to talk about and this is why it's zero zero is how to open an image in python with opencv so the way in which you open a video or an image file in python with opencv is we're gonna say create an image name let's call it image underscore file this is going to be the location of our file and we're going to be referring back to image underscore file throughout our script so we know that our image is in data and it's page underscore underscore01.jpg so page underscore01.jpg so it's a jpeg and then what we need to do is we need to load that image file into memory now there is a bunch of different ways to actually do all this in python or pythonically some people will say i am other people will say img others will spell out image and the problem with this is that opencv's documentation is oftentimes neglected speaking of which you need to actually import opencv and now if you look at opencv's documentation you'll notice that they say import cv2 as cv let me zoom in a little bit for you the problem with this is that the community has not done this so what i'm going to do is i'm going to do what the community does because it'll make your life a lot easier and we're going to import cv2 we're not going to import it as cv now i personally think that you one should follow the documentation i am making an exception here because anytime you try to look for a problem with opencv on stack overflow everyone's going to be doing import cv2 or at least a lot of them so once you've got that loaded now it comes time to create the image file the load the the the item into memory and we're going to do this by saying cv2 dot i am so image read and we're going to pass in one argument right now now there's multiple parameters that you can pass here we're going to pass in our image file and when you do that you've loaded it into memory and i got a syntax error because i forgot to hit equals there we go now if i'm working inside of atom i can show an image in a very specific way i can say cv2 dot i am show it stands for image show i'm going to pass in two arguments think of this as a title so original image and the next thing i would pass in is the image itself img then what i can do is i can say cv2 dot weight capital key and this is going to allow for it to wait now i can specify in milliseconds how long to wait for so 100 milliseconds would be 100 milliseconds or i can do zero which will be an indefinite amount of time if i execute this you'll see that i have an error cv2 is not defined let's go ahead and start this there we go and then i can open it up now because i am using jupyter lab you'll see that i can actually open up the image if you're using jupyter notebooks not jupyter lab your python might crash and if you're going to do that you're going to have to do something like cv2 dot destroy all windows there's a couple little hack arounds what i'm going to do however is because i want to display inside of jupyter notebook i want to be able to display the image in line meaning i want to be able to see it all the way through i'm going to use this function that i found on stack overflow and i'm going to include this and where i got it from right here so this is the the link and this is a handy way of using matplotlib to actually show what's happening let me try to get over that way there we go and i'll just put it down there to show what's happening in line from opencv and so i've added this little bit up here from matplotlib import pi plot as plt so what i can do now is i can execute this cell so i've loaded that function into it to memory and what this function is going to do is it's going to allow for me to maintain the original image size as i use map plot lib to kind of display it now what i can do is i can just use one simple function to display my image and this function is called display so i can say display and then i can pass in this is going to take the path and my path is image underscore file i can run that and then i can display the image in line now i have the full proper uh resolution of the image so it's a bit large for this video but i think it's important because if i didn't do this and if i were to just use matplotlib it'd be a very small file and it would be also difficult to actually decipher the actual characters or the words the letters in this image so this is the original image that we're going to be working with throughout this script what we're going to be doing now throughout the rest of the script is we're going to be following these nine steps to invert the image to rescale the image to binarize noise removal etc on down the list in other words we're going to be using opencv to manipulate this image and then at the end of this video i'm going to we're going to create functions to do all these different steps at the end of this video we're going to display all of these images side by side in a 3x3 column with the original image up above and that's going to allow us to see how each of these steps affects our output a little differently so we're going to be trying to get rid of borders we're going to be trying to clean up some of this noise we're going to invert it we're going to convert it to black and white all in the next few steps so let's go ahead and start with inverted images so our first task is going to be to invert an image an inverted image is kind of what it sounds like it's when you take all the the different pixels and you make them the exact opposite of what they are so things that are in the the white spectrum will become black things in the black spectrum so grays dark rays blacks will become uh the opposite so their inversed counterpart in the white spectrum this used to be a great way to pre-process an image for ocr and tesseract with tesseract 3x so test rack 3.0 if you're using test direct 4.0 uh this actually will return poor results and the way in which you pre-process an image is to get the image into the optimal uh shape or optimal way so that it can be processed the way it would the way the training data was processed so the way the system learned on the types of images that's how you want to kind of get those images to the to the ocr system so what even though this is irrelevant for tesseract 4.0 i still want to show you how to do this because this is an important step with certain cleaning methods so how do you do it well fortunately there is a built-in way and you can do this with opencv so what we're going to do is we're going to create a new object we're going to call this inverted image and we're going to make that equal to cv2 dot bitwise and we're going to pass in img our our image if you remember that we loaded in up above right here this img so let's scroll back down and then what we're going to also do is we're going to save that file into our temp data folder so that we can then display it with our function that we wrote up above so what we're going to do is we're going to say cv2.imright which is going to allow us to write that file somewhere in our directory and we're going to say where we want to write that to so we're going to say temp backslash inverted dot jpg so jpeg and then we're going to say what that file is we want to save our inverted image and if we do that we see that we've got an attribute error so uh bitwise is uh uh i forgot to do bitwise underscore not there we go and comes back as true so we've now officially saved that file now let's use our display function to display it so we're going to call in that temp inverted dot jpg and we're going to run this and now we have an inverted image of that original file so our original is up here and we see that the inverted does exactly what i said it would do all the the lighter areas are now dark all the darker areas are now light this however is not an essential step in in tesseract 4.0 but that it's an important skill to understand how to implement the next thing i want to look at is something known as rescaling now i'm not going to do a lot with rescaling right here because um it's really difficult to do well and it's not going to be entirely necessary essentially there is a an optimal range for your image file to be in for it to have optimized ocr and that range is going to be defined by the height of the characters which comes down to the dpi this is in the tesseract ocr documentation i'm going to address this in a later video because it requires um a bit more knowledge of opencv to do it well and we're not there yet with this series i am going to however fill this in when i put the jupiter notebook onto github so you'll be able to see that we're going to skip that just for right now and we're going to jump down to binarization binarization is the process in which you binarize or convert an image into black and white now in order for an image to be converted into black and white well it needs to first be in grayscale so the first thing we need to do is we need to create a function that's going to automatically convert for us our file our image into grayscale so that we can then save that gray file and then use that gray file to do binaurization so first let's tackle the first problem a function to get our image into grayscale so we're going to call this grayscale and actually we're yeah now we're going to call it grayscale and what we're going to do is we're going to say this is going to take a single thing and that's going to be the image itself and then what we're going to do is we are going to uh just return we're going to return cv2 dot cvt color with a capital c and we're going to convert that image this is going to take two arguments the image and then how we want that image to be converted and this is going to be the part of opencv syntax syntax it's going to look a little odd to you if you're not familiar with it we're going to say cv2 dot we're going to say capital letters color underscore b g r to the number two gray what that's going to do is it's going to return for us a gray scale image so let's go ahead and just save that function there and now let's create that gray image so we're going to call create an object called gray underscore image and we're going to make this equal to grayscale img so we're going to pass that object into it and then what i want to do is i want to say cv2 dot i am right so we're going to write an image now again into our temp folder this is going to be our grayscale image we're going to call this gray.jpg and we're going to make sure that that is the gray image and if we execute this everything works great um everyone's happy we got true means our image is now saved let's go ahead and display that that image we're we're gonna display temp backslash gray.jpg and let's display that and see what it looks like this is now a grayscaled version of our file and if you saw it side by side it doesn't look too different but you'll notice that the uh the tans or the hues of tan so the beige colors are now grayed out this is what the grayscale image looks like now that we have a grayscaled image it's time to actually binarize that image and convert it into uh into some kind of strictly black and white and the reason why you want to convert to grayscale first is it's going to allow for this process to be um done a lot more easily and so what we're going to do now is we're going to say thresh comma i am underscore bw this is going to stand for threshold and image black and white this is the pythonic way to do it we're going to make that equal to cv2 dot threshold so this is how we adjust the threshold of an image in opencv and we're going to say gray image so the first object or the first parameter here the first argument needs to be the image itself and then we're going to have two integers i'm going to explain these integers in a lot more detail later on for right now just understand these will control the threshold and the range oftentimes you're going to start off with one 2 7 and you're going to start off with 2 5 5 here i'm going to explain why with text this isn't always the right the right parameters here and i'm going to be adjusting this in just a second this is just to demonstrate what's going on the next thing we're going to say is cp2 dot thresh this is going to be all capitals underscore binary there's about six or seven different ways of adjusting the threshold here and i'm not going to address all of them right now because we're just trying to get the basics of this process down once we've done that we can now say uh cv2 cb2.i am right we're going to write the im underscore bw we're going to write that black and white image down so to do this we're going to say temp backslash bw or sorry bw image so the black and white image dot jpg and then we're going to say that we want it to be the imbw let's go ahead and run that and make sure everything comes out correct and it does now comes time to display it so now we're going to display temp backslash bw image dot jpg and what we're going to see now is a black and white image now i want you to take a moment and look at this does this look like a good result the answer is no but what's interesting is that while this is more difficult for us humans to read um certain things might be easier for a machine to read but right now we don't have the correct keyword parameters uh we don't have the correct parameters here and instead you want to adjust this to somewhere probably around for this uh let's say 200 and 230 and re-run this these two cells and now you get something that looks like this what you have here is a much better result in fact you're probably going to play around with these numbers a little bit more but what you're getting now is an image that does not have a lot of the noise that is going on that you saw in the original image up here all these little pixels of hues and all these different things throw off ocr radically now with this particular image we could probably run this through pi tesseract without any kind of manipulation and have fairly good results but we're not after fairly good we're after great results by doing these kinds of pre-processing things to our ocr or to our image file we are going to have much better results with the ocr process and that's because we're able to contrast the actual script here the actual font the actual characters on the page we can contrast them much better because we've made that background that all those different colors of beiges and shades of of dark colors are now one color and that's bright white and the same thing has happened with our actual pixels that represent the characters on the page they're now a dark black this is going to allow for much better ocr and i cannot stress to you enough that this right here this conversion to gray and this conversion to uh threat playing with the threshold to create a binary image this is the most important important step throughout all of these next few steps i'm going to be showing you is getting a good binary image and you're going to play with these numbers a little bit to get something that works best for your image at hand but as we're going to see there are different times when different methods are going to be necessary and what we're going to be looking at over the next few minutes is what do we do about images that might have a lot of noise in the background how do you remove that that's going to be what we look at now so what we have now is a fairly good image but one of the things that you if you look at this image you might immediately notice and though you don't know the word for it that there still remains some noise what is noise well noise is anything like this pixels that do not correspond to text that are still surrounding certain text items so this would be considered noise down in the bottom right hand corner these pixels surrounding the number one but you're going to really have a problem with things like this now truth be told this right here in this state is good enough to get state-of-the-art ocr results and when i say state of the art i mean 98 percentile range this is good enough however we can do better and the reason why i'm going to take the steps to show you how to do this with this document even though it's in a good enough state is because not all documents are created the same some documents you're going to find are going to have a lot of noise around them and what i'm going to show you in this video is how to get rid of things like these dots around the word internal and affairs these little dots down on here and we're going to do this using a method that we're going to see a lot more detail in just a second and we're going to use dilation and erosion in order to do noise removal we're going to create a new function let's call it def let's call it noise underscore removal this is going to take one argument it's going to be the image that we passed to it now within this function we're going to have to import numpy so we're going to import numpy as mp numpy is a way of working more efficiently with numerical data and memory because of how it's stored i'm not going to get into it in this video understand that it is a central component of most machine learning and most robust libraries in python so we're going to have an object called kernel and this is going to be equal to np.ones and we're going to pass in the size for our first argument of the actual kernel this is going to be the kind of think of it as the the shape of how we're passing over and capturing noise and the second argument is going to be np dot unit u and there we go 8 what we need to do is we need to start adjusting that image based on these kernels so we're going to say image is going to be equal to image so we're going to i'm sorry image is going to be equal to cv2 dilate so we're going to use the cv2 dot dilate and we're going to make this equal to or have a few different parameters here the first thing is going to be our image that we're passing to it the next is going to be the size of the dilation that we want to see and that's going to be our kernel and the next is going to be a keyword argument called iterations we're going to make that equal to one so just one pass over the image for dilation next we need to have really another kernel because you might make this an e actually there we go that way it's pythonic um the other thing that you would want to have is another kernel that will affect the dilation of the erosion now you're going to want to keep these in the in my case right now the same because they're going to be doing the exact same thing but in certain problems you're going to want to have these be different sizes for different tasks just for right now we're going to kind of repeat all these same things the second parameter here is going to be np.unt8 and what we can do now is we can do image equals we're going to modify the image once again and this is going to be equal to cv2 dot and we're going to say erode this time and the erosion the erode class is going to take a few different things it's going to take the image just like we saw before it's going to take the kernel and then again we're going to have iterations equal to 1 and that's going to erode the image a little bit for us the next thing that we need to do is we need to morph the image so we're going to say image is equal to cv2 dot morphology capital e here x lowercase x and we're going to say image and then finally our second thing is going to be cv2.morph close all caps here all caps very important and next we're going to say kernel to capture that uh that size once again and then next we're going to say image is going to be equal to cv2 uh dot and we're going to blur it a little bit i'm going to say median blur and again think of all these things like the different steps you might do like in photoshop or some kind of image processing library and finally we're going to take a second argument that's going to be 3. again these are controlling the shapes of these things the shape of the blur and then we're going to return the image so this is going to be our steps for noise removal now in order to actually bring in that noise removal and and execute it we need a way to uh essentially write it down into our temp folder so let's say no noise this is going to be our new object is going to be equal to noise removal now here we're not going to say image we're going to now grab this black and white image that we created up here so that's i am underscore bw and then what we're going to do is we're going to write it so we're going to do cv2 dot i am right now we're going to write it into our temp folder once again we're going to call this no underscore noise.jpg and that's going to be our no noise right there and let's go ahead and load that function into memory execute that and we've got an error stevie to morph what did i do wrong i think i spelt morphology wrong let's see more f there we go now we should be good there we are and uh we get the true back so everything's worked out great now let's go ahead and display this file we're gonna display temp no noise noise.jpg and let's see what it looks like and this is now what our image looks like i want you to have a particular note here on all of this noise that's down here one is now appearing without any kind of stuff around it if we scroll back up you can see that one was clouded with a whole bunch of noise now like i said for this particular problem one would have probably been captured perfectly without issue same thing with internal and affairs but let's look and see for a second what we've done all that noise all of that noise surrounding internal and affairs is now gone however it came at a cost and that cost was that our our characters are now a little bit more emboldened if we look back up here you can see that internal is a little bit more legible here in this instance i probably would not have done noise removal i would have done a test by passing this image to pi tesseract and if it came back perfectly fine i wouldn't have done noise removal if however this noise was throwing off pi tesseract i would then move to the noise removal step in order to actually get rid of that noise because this is a better result if it's illegible okay so what do we do about texts that have really thick letters that might throw off ocr results or maybe really thin letters that might throw off ocr results the answer comes down to the two things that we just saw a second ago that we're now going to dive a little bit more into and that's dilation and erosion in opencv so let's scroll down and let's start working with dilation and erosion so when you want to use dilation and erosion on their own without the the morph the morphology x and the medium blur is when you don't have noise in the background these are the main things that get rid of the noise the morphology x and the medium blur when you don't have noise in the background but you've got font that looks a little too thick or font that looks a little too thin you can use the dilation and erosion methods in opencv to actually adjust the font sizes so let's go ahead and look at a function to make our let's try to make our font size thinner now what you're going to see in this process is we're going to make it where it's actually worse i would consider this font at this stage to be fine it'll work perfectly fine in a pie test react right off the bat but let's try to just experiment with thinning the font and thickening the font so that you can be able to do that with a problem that might require it so what we're going to do is we're going to create first a function we're going to call it thin font not a very clever name but one that'll work for us just fine what we're going to do is this is going to take again just one argument that's going to be the image itself and we're going to import numpy as mp at this point i should probably just have that at the top of the script uh we're going to say image is going to be equal to and this is we're going to be where we set kind of um a new image here now in order for dilation and erosion to make sense uh they're kind of backwards and the reason why they're backwards is because they're meant to handle an image that has a that has the background the surface is black and the font and the text is white so what we need to do in order for the dilation and erosion to make sense is we need to convert our image into black and white or sorry we need to invert our image now i showed inverting an image up above essentially you use cv2 dot bitwise underscore not image and that's going to make our our image now the exact opposite what we're also going to do is we're going to use the same kernel that we saw up above except this time we're gonna make a kernel size of two comma two and then what we're going to do and this is gonna be the new part we're gonna say image is gonna be equal to uh cv2 dot erode so we're gonna erode the image this is going to thin the text erosion is thinning of pixels down we're going to say image kernel and then we're going to say iterations is equal to we're just going to do one here and then what we're going to do is we're finally going to reconvert that image from revert it back so we're going to do an inversion once again so we're going to say image is going to be equal to cb2 dot bit wise underscore not image and that's going to revert it back to white being the background and black being the font then we're going to return image and once we have that function loaded we see that it's going to work just fine it's time to make our eroded image we're going to make this equal to uh thin font and we're going to pass in our let's pass in our no noise image we're going to pass in this image right here the one that we just got no noise from and what we're going to then do is say cv2 dot i am right we're going to write the image the eroded image to temp eroded image dot jpg and that's going to be our eroded image let's run that cell we see true fantastic and now let's display it so we're going to say display temp backslash eroded image dot jpg and we're going to execute that and you now see that the font is a lot thinner now i can up the kernel size to three it's going to start becoming a little illegible here and i would argue not good so let's change that parameter there rerun these cells why not and we see that we have something that looks like this this is not good you could also play around with the number of iterations that you do with this so we can make it where it's let's make it 10 iterations so 10 passes over that and you'll see something that looks like this where it's just a completely white page because it's gone through that same process twice but you get the idea you can play around with these numbers and every single ocr problem is a little bit different now i would argue that this is not good and it's not um the other thing that we can do let's go back to one and adjust this down is we can also thicken our images and the way in which we can do that is by let's go ahead and just let's create a a few new cells here and we're going to copy and paste this down because we're only going to be changing really one thing from this function we're going to call this dungeon thick fond thick font again it's going to do all these same things except here we're going to say dilate dilate is the expansion of pixels so in this scenario we're going to expand the pixels and so for that let's say a dilated image and that's going to be equal to thick font and we're going to pass in the same ib im underscore bw or sorry the no noise image is what we're going to pass to it and then we're going to say cv2 dot i am right and we're going to write this as temp dilated image dot jpg dilated image and then we're going to execute that cell true great it looks like that function works just fine now we need to display it so we're going to say display and that's going to be equal to temp backslash dilated image.jpg and then we execute this cell we'll see a thickened font now in some cases this is going to be an essential step if you notice this is going to be particularly useful for problems like we saw here the 1944. you can see how the 44 there doesn't look as clear as the 44 down here so in some instances you can see why the thickening of a font might be essential let's go back up to this you'll see it's looking a little bit better than it did before just a little bit not a lot again because this is not an image that requires a lot of these processes and you're going to find that when you're working with a large pdf that has a large set of similar documents you're only going to have to find the sweet spot just one or two times and adjust accordingly so this is kind of how you solve a problem of too small a font into larger font using erosion and dilation and it is important to remember that these processes will work in the opposite order unless you bind or re invert the images at the beginning of the function and at the end of the function so and the next thing we're going to talk about is rotation and d skewing and for this i'm going to have to use a different pdf a different image because this honestly is not bad enough where i could even justify using it you're going to want to use rotation and de-skewing when you're dealing with a pdf or an image file that is pivoted sideways in fact one of the things i might do is i might just go ahead and load in this image skewed so that we can then deskew it okay now that we have gone through and shown you how to kind of do these different methods for expanding or contracting with erosion and dilation it's time to show you how to do some really important stuff with a lot of bad ocr images or a lot of bad images of text and one of the common problems that you'll see with images of text is that you're going to have pictures that look like this now this is the exact same uh file all i've done is i've just rotated it and this is important to note here for this method to work i've already eliminated the borders i'm going to be showing you how to identify and delete borders in the next segment of this video for right now however we're going to focus on just how to handle a rotated text the reason why rotated texts will not be picked up is because of one very important thing it's how ocr works ocr is designed to work with vertically aligned text meaning it expects for the images to look like this with the text straight up one of the things you're going to have to do as part of your pre-processing is take in images frequently that look like this and correct them now i'm going to be copying and pasting in some code here from a website i did not write the code that you're going to see here in just a second i'm going to copy and paste the functions in but i'm going to explain essentially how it's working and why it does what it does and why it's an important step so let's go ahead first and foremost and load in our our actual image we need to load in this page underscore zero one underscore rotated dot jpg so the way which we're gonna do that is we're gonna create a new image let's just call it new we're gonna keep this very simple cv2 dot i am read so this is the same syntax that we saw up above and this is located in data backslash and it's going to be page underscore one underscore rotated dot jpg now let's go ahead and load that in it's now loaded into memory fantastic now i'm going to copy and paste this code in i'll go through real quickly real quickly and explain what's happening and why it's important so yeah a lot of code to go through so what we're doing is we're importing numpy same thing that you've seen before and what this function does and i'll give you the the source for it right here as well it's from becominghuman.ai from what i have seen across the internet this solution is fairly common and it is let's scroll back up and it's used by a lot of different people in a lot of different places i don't know if this is the original source for this code but this is where i initially found it that's why i'm providing it you will see it in other places so what this does is it goes through and it processes the image and it grays it it blurs it it adds a threshold and it adds a really large blurring effect to it so that you can identify contours now contours are going to allow you to draw bounding boxes i'm not going to talk about bounding boxes in this video i will however give you a quick little glimpse at what a bounding box will look like on an image so this is the image and this is the way bounding boxes work now if you notice the bounding boxes are angular so they're not actually able to capture text in a horizontal way what we're going to do is we're going to use things like that in this function to make a prediction about how the text ought to be laid out and that's essentially what this function does it captures the angle of the of the text and makes an automatic guess about how best to adjust the image to capture the correct rotation and it's again very important for this that you do not have a border that's very important if you have a border you will have problems removing borders should occur first so now that we've got this function and everything loaded into memory it's time to actually use the function and what's nice and handy about this is that we can do a lot of very complex things with let me add one more function and then there we go we can do a lot of very complex things with really just one line of code now so we can call this new object fixed we're going to make that equal to d skew so we're going to call in this new skew function the skew function which is going to run all these all these functions call these other functions up here and it's going to essentially output an image for us so this is going to take in new that image that we loaded in up above the one that is skewed and what we're going to do is we're going to say cv2 dot i am right and we're going to write that that fixed image to our temp folder and we're going to call this let's call it temp i think i've called it rotated fixed dot jpg and we need to pass one more argument and that's going to be the object itself and looks good there what we're going to do now is we're going to display we're going to display that temp backslash rotated fixed dot jpg we're going to display it now and now look we've taken that rotated image in fact let's go ahead and display that rotated image up here so you can kind of see them right next to each other we're going to display data page 01 rotated dot jpg there we are let's go ahead and load that in so it looks like this initially right definitely skewed definitely slanted what the function has done is it's taken it and twisted it around for us based on the bounding boxes based on the contours it was able to identify i cannot stress to you well enough how important this function is for having better ocr results you'll be very happy it will save you lots of time of scouring the internet trying to find a solution to a de-skewing problem that you can just copy and paste this code in again i've given you the source if you want to explore the the blog on this a little bit more but you'll find it in multiple places across the internet now if you notice you look really closely you'll see something that's happening here it's blurring if your text like for example if that were cropped like right there on the side that little letter would be skewed a little bit so you want to make sure that if you're dealing with text where the board the text is butting up against a border you're going to want to also add in a little bit of a border around your around your picture as well or around your your image as well again we're going to be dealing with removing borders and missing borders right now okay so now we're going to try to remove some borders i'm going to be working with the no noise image that we saw just above don't remember i'm going to go ahead and display that image right now by typing in display and we're going to do temp backslash no noise dot jpg and we're able to see what this no noise image looks like and if you notice we have a very clear black background ignore this white bit right here this is just the jupiter notebooks rendering of its own border but the actual border within the image itself is right here and it's black what we're going to do right now is we're going to try to eliminate that border we're not going to be able to get all of it with this method i'm going to show you another method in a later video using bounding boxes in a more robust way where we can achieve this when i talk about bounding boxes in more depth right now though we're going to get rid of most of it and the reason why we can't get rid of all of it is because of this little bit over here on the right hand side it is uneven the border is uneven it's not a straight shot and that's going to cause a little bit of this border to remain with this method but let's go ahead and jump right into it what we're going to do is we're going to create a function here let's call this let's get creative and call it remove borders it's going to take one argument once again this is going to be the image that we're going to work with it's going to be the in this case the no noise image that we're going to pass to it now we're going to have two arguments coming out or two objects coming out of this one line of code here which is going to be cv2 dot find contours with a capital c there very important and we're going to pass in the image and then we're going to do cv2 dot ret r underscore external cv2 dot chain underscore aprox underscore simple i'm not going to explain too much about what's happening here because what we are dealing with is something that is beyond the scope of this video i'm just trying to give you the code necessary to achieve this task the next thing that we're going to do is we're going to sort out cons cnts sorted this is going to be the contours sorted out we're going to make this equal to sorted and again we're going to be using something a little bit beyond the scope of this video which is going to be lambda i'm not going to cover lambda right now but essentially this is going to allow us to organize all of our contours all of our little bounding boxes that we're going to make here into a nice sorted list based on size area and we're going to pass an x and then we need to do is we need to grab the last item in that list because the last item that list is going to be the largest is going to be the largest um bounding box which means it's going to be the bounding box that covers this main block of text everything else is going to be small little things like maybe maybe that bit there maybe that bit there i'm not entirely sure what it's going to grab but i know that the largest thing is going to be all of this white area and then what i need to do is i need to do x comma y comma with comma height w and h we're going to make that equal to cv2 bounding rect and we're going to pass in that one contour that one box which is going to be like i said the largest box that we have and we're going to set crop equal to image and we're going to crop the image down this is how you crop an image you're going to say y colon y plus i always have to cheat with i got a cheat sheet on the other screen over here because i have a very hard time remembering the proper syntax for opencv this is not a very in my opinion not a very natural library i have a hard time remembering the syntax and there is i'll provide a link in the description down below some good cheat sheets for this uh these basic lines of command but i oftentimes find myself going to stack overflow for a lot of this multiple times when i'm working on anything with opencv so this is our function it's going to return a cropped image essentially now if you're dealing with pdfs where the margins the borders are consistently the same cropping on any side do not use this method pre-process them with a pdf editor or image editor to edit in bulk if you already know the determined size otherwise you can if you know the determine size and numbers such as width height x and y you can go ahead and do that in opencv as well by passing in those arguments manually what this is doing is it's creating those bounding boxes and finding the the borders automatically this is particularly good if you're dealing with inconsistent borders across images that you're trying to ocr so we're going to make another object called no underscore borders and this is going to be where we call in our function remove borders and we're going to pass in that object of no noise and then what we're going to do is we're going to say cv2 dot i am right we're going to create this no borders temp file like we always have done in the past no borders dot jpeg and we're going to pass in the other argument which is you guessed it our object and then what we need to do is display temp no borders dot jpg now if i've written this function correctly uh looks like we've got an air here uh find contours oh counters contours there we go i might have another error uh cv2 has contours area what have i done wrong contour area is what this should be i apologize there and then we have got uh it's not defined [Music] there we go i forgot my s right there and looks like i have yet again another error uh it looks like i haven't done cb2 dot bounding wrecked i apologize sincerely if you followed along with that um but as you can see once you've achieved this we have gotten rid almost of all of the borders like i promised you there is a little bit still here because it was not a straight shot there are other methods to get rid of the text around the borders specifically to grab paragraphs in this scenario if you're trying to automate the process i would not use this method i would use bounding boxes to grab the main body of the text i'm going to cover that in another video because this is strictly dealing with the pre-processing of the image as a whole to get better ocr results so that's going to be how you add or sorry remove a border now we need to talk about how to add in missing borders now why should you do this and you should do this in certain circumstances when your text is all the way up against the border so it's when the image is cropped the a is sitting right at the very end the reason why you need to do this is because the models were trained with some borders it's going to have a hard time identifying the characters at the side of the page unless there is a white border or the whatever the surface color is so let's figure out real fast how to add in a missing border now now to achieve this what we're going to do is we're going to be working with that image we just created without any borders at all so what i want to do is i want to start off by grabbing the color now the color that i'm going to specify is an rgb i want this to be white so i'm going to do 2 5 5 two five five comma two five five if you wanna learn about rgb colors i encourage you to explore them um essentially it's a way of creating i think it's somewhere around 16 million different colors per pixel this is how all your modern day tvs work with the exception of specialized monitors so the next thing we're going to do is we're going to specify the borders themselves we're going to say top bottom and we're going to say left right and that's going to be equal to 150 times four so we're going to create a border that's a width of 150 for all sides now what we need to do is we need to create the image with border we're going to make that equal to cv2 dot copy make border this is how you make a border in opencv we're going to pass in that object that had no borders that we just created a second ago top bottom left right and then we're going to do cv2 dot border underscore constant and the value is going to be set to our color this is going to be our in this case white color what we're going to do now is we're going to say cv2 dot i am right and we're gonna pass in uh two objects the first is going to be temp and we're gonna call this uh let's call it image with border dot jpg we're gonna pass in image border and then we're finally gonna do our display temp image with border.jpg regretting that name now having had to type it out so much so let's load all this stuff in and let's run it and we now have a white border that is much much larger so this is how you get rid of borders this is how you add borders both of these are going to be necessary steps in pre-processing to which you've achieved better ocr results now these are not all of the things that you're going to have to do i'm not going to cover transparency in alpha channel in this video because it will be a little tangential if you want to know about it leave me a comment or question in the comments down below essentially it's going to deal with when you're getting a problem with certain png files but right now i think this should cover all of the basics for how to manipulate images in mass to have better ocr results in the next video i'm going to introduce you to pi tesseract to show you how to start taking these pre-processed images and now running them through pi tesseract and we're going to see how each of these different images results with ocr and i'm also going to introduce you to the key values that you can pass through pi tester act to actually account for different types of text that's going to be all in the next video as we learn the ins and outs of pi tesseract all of this will set us up for part 3 of this series where we explore concrete complex problems with ocr handling texts and different type fonts type styles handling text that is organized in columns handling text that is in tabular data in cell format we're going to be addressing all these concrete narrow problems in part three of the series that's going to be it for now though thank you for listening if you've enjoyed it please like and subscribe down below and as always thank you to my patreon supporters and if you enjoyed this channel you found this video useful please consider subscribing and contributing via patreon and you can find that in the link down below
Info
Channel: Python Tutorials for Digital Humanities
Views: 130,734
Rating: undefined out of 5
Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, opencv tutorial, opencv for ocr, image preprocessing for ocr in python, opencv python, opencv python ocr, opencv and python for ocr, ocr and python, opencv and ocr, how to preprocess with opencv, image preprocessing with opencv, image text python
Id: ADV-AjAXHdc
Channel Id: undefined
Length: 53min 23sec (3203 seconds)
Published: Wed Apr 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.