How to OCR an Index in Python with PyTesseract (OCR in Python Tutorials 03.01)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello and welcome back to the series on ocr and python using pi tesseract in this video and really the next video we're going to be tackling a single ocr problem and that is how to ocr a tri column or an index and what we're going to see here is that we're going to be working with latin if you don't know latin don't worry at all this is just going to be a case study we can do this with any language so what we're going to do in this video is we're going to try to use pi tesseract and pillow to automatically ocr this entire page and what we're going to see by the end of this video is that just ocringmit is not going to work very well we're going to need to do some pre-processing methods that are a little bit more complex than what we've seen in the past specifically what we need to do is we need to extract the sections that we want to ocr specifically these columns so that we can have better ocr results and ocr results that will be in sequential order i'm going to be going over how to achieve that in the next video and really it's gonna just follow this one immediately after for this video though we need to learn the basics we need to learn how to open up the file and how to actually ocr it like we've seen in the past but most importantly we need to understand why these methods won't work in order to understand why we need to use more robust methods to solve this problem so let's go ahead and jump right in the first thing that we need to do is we need to import pi tesseract like we've seen in previous videos and the next thing that we need to do is we need to actually import pillow which we're going to be using to solve this to solve this problem we're not going to be using opencv in this in this video and specifically from pillow we need to import image with a capital i let's go ahead and execute that cell fantastic the next thing we need to do is we need to identify the file location we're going to create an object called data and i've got this stored as index underscore o2 dot jpg so index underscore o2 dot jpg that's going to be our actual file now the next thing that we need to do is we need to open up that file so let's go ahead and do that to do that we're going to say image is equal to image with a capital i dot open and we're going to open up that image underscore file and if that runs correctly then you've done everything right so far the next thing that we need to do is we need to pass that image into pi tesseract so we're going to create an object called ocr result and we're going to make that equal to pi tesseract dot and this is where we're going to pass in one single uh we're going to call in the image to string and to do that we're going to say image underscore 2 underscore string and we're going to pass in one argument and that's going to be our image file now once we've done that pi test react is running and it's ocring that image for us once those results are done we're going to be able to do print off ocr underscore result let's see what that looks like it looks pretty good we're getting the whole page though and this is one of the limitations with pi tesseract is it's trying to understand this whole page as a single column of text we haven't given it any specific instructions whenever you can take a multi-column page and reduce it to individual columns that can be individually ocr'd you're always going to have better results nevertheless this is not too shabby were i interested in constructing and taking this ocr to generate a list of named entities or potential named entities which this is it's an indices of names which is what that means in latin index nomineum then i would need to do some other things and that's going to be what our goal is over the course of this video and the next video is to generate an output that looks something like this a list of named entities that are going to be occurring in a specific text this is going to be very useful for making ner models which i are named entity recognition models which i've dealt with a whole bunch on the series and i've written a textbook on so let's go ahead and start thinking about ways that we can take this output right now and start doing that well one of the things i could do is i could eliminate all the line breaks in this right and then one of the things i could do is on each line i could eliminate or separate everything out with a space and then after that i could start making some deductions i could say like if if the first letter of that item on the page begins with a capital letter then maybe that's going to be where i want to start but i have to do a lot of post processing one of the things that i can do however using pi tester act and opencv that pillow cannot do is i can use computer vision to automatically identify columns in this image and then go through an ocr each individual column and what you're going to find is that by doing this you're going to have better ocr results the oc ocr results will be in left to right order if we sort the contours which i'm going to cover in the next video and the other thing that's going to be better is that we're going to have an easier time at generating rules that will consistently work across an entire index not just one page as always i try to keep all of my content on this channel free to the public if you enjoy this channel and you're getting a lot out of it please consider contributing via patreon and as always thank you to my patreon supporters

Info

Channel: Python Tutorials for Digital Humanities

Views: 15,569

Rating: undefined out of 5

Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, ocr in python, ocr an index in python, how to ocr in python, ocr with pytesseract, how to use pytesseract in python, image_to_string, pytesseract index, pytesseract latin

Id: DXYPXZH2eGE

Channel Id: undefined

Length: 5min 36sec (336 seconds)

Published: Sun May 30 2021