Introduction to PyTesseract (OCR in Python Tutorials 02.03)

Video Statistics and Information

Video

Captions Word Cloud

Captions

[Music] hello and welcome back to the series on performing ocr and python now in the last video it was quite long i introduced you to a lot of different methods that you could and really should apply to image pre-processing to achieve better results with pi tesseract now comes the time in the series to introduce you to the basics of pi tesseract this video will be much shorter i will introduce you to the complexities of pi tesseract as we move forward into the next part of the series and tackle narrow problems everything from receipts to the single column text which we're going to do a little bit of in this video two multi-column texts such as indices uh and all the way up into tabular data so let's jump right in of the basics of pi tester act how to interact with the library the first thing you're going to going to want to do is to import pi tesseract now if you haven't done so already please watch my video on installing these libraries because pi tester act requires you not only to install it but to have it a tesseract in the path now the next thing we need to do is we need to say from pil all caps there import image with a capital i let's go ahead and execute our import cell now we need to kind of create a file that we can kind of call back to as we move forward we're going to call this our main image file and this is going to be database databackslastpage01.jpg that standard raw image that we initially worked with and just so you can see what that looks like let's pull it up over here real fast and this is all in the github it looked like it looked like this unedited it still has the border no no alterations to the coloring whatsoever the other thing that we're going to be working with is the no noise image here what you're going to see is the radical difference in our output from these two images so let's go back into this this jupyter notebook now now let's do let's say we want to create the no noise image file as well so we're going to say that that's temp because that's in our temp folder and that's going to be no underscore noise.jpg remember we made all these in the last video let's go ahead and just create those objects now we need to create an image in memory we're going to be using the pillow library for this so we're gonna say img is equal to image with a capital i dot open and let's test out our original image file now that we have that loaded in memory let's try to create an ocr result and here's what we're going to be using the pi tesseract library let's call this ocr result as an object and we're going to make this equal to pi tesseract dot image to string so what we're doing is we're taking in the image and we're telling pi test rack to convert it into a string for us in other words we're telling it to ocr the actual image and we're going to say that we want to ocr the image file let's go ahead and do that and let's print off ocr result and see what everything looks like it doesn't look like anything we have not done good ocr and the reason for that is because the image has not been pre-processed the color of this image the the beige-ish surface the light and very faint uh font here is not allowing for pi tesseract to actually produce results so how do we resolve this well we resolve this in the steps that we did in the last video through image pre-processing now let's take a look at the results of our no noise image and this is what the no noise image looked like in case you don't remember it was very much a different image than our page 01. this looks a lot better it doesn't have uh it has some border still there but we have text that bounces off the page for even the human eye and it's going to result in much better ocr so let's go ahead and zoom back in and let's change our image that's in memory to no noise and let's rerun this in voila like magic we have good ocr results are they perfect it looks pretty close to perfect you'll get a couple normal things that pop up such as this weird colon that's appearing here and this is likely due to some kind of noise that wasn't eliminated something's being caught over in this area perhaps it's this pixel right here that's doing it i'm not entirely sure the other thing that i notice immediately is this i right here is being rendered in our script as a or in our output as a j this is quite common these are known as like the dragons of ocr ocr with machine learning is never going to be perfect if you're in the 98 percentile range that's considered good enough and the reason why you're going to use ocr isn't to have necessarily flawless results it's to ocr and get raw text for a giant corpus that would be impossible to transcribe by hand or not worth your time and you're accepting that two percent trade-off of air this is normal and the percentage of air that you're willing to accept is going to be a little bit different depending on the source material and the quality of the images that's going to be it for this video and i hopefully you have a good sense of not only how to interact with pi tesseract on a very basic level but a good understanding of why all the steps that i introduced you to in the last video were important remember the no noise image was the result of several different layers of pre-processing moving forward in the series we're going to be shifting over to part three now and solving concrete examples that oftentimes appear that you have to know how to solve and we're going to be doing that by using pi test reactance kind of custom config settings to adjust the parameters of tesseract so if you've enjoyed this video if you've gotten something out of it please like and subscribe down below and as always thank you to my patreons i now have four and i'm very happy about that

Info

Channel: Python Tutorials for Digital Humanities

Views: 44,805

Rating: undefined out of 5

Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, pytesseract tutorial, pytesseract and python, how to use pytesseract, pytesseract, ocr in python, python ocr, python and ocr, ocr and python, pytessseract and ocr

Id: 4uWp6dS6_G4

Channel Id: undefined

Length: 6min 18sec (378 seconds)

Published: Fri Apr 09 2021