Introduction to OCR (OCR in Python Tutorials 01.01)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello and welcome to this new series that is long long overdue i've been promising it for about three months now and i'm finally getting around to doing it because my text classification and topic modeling series is wrapping up this series is going to focus around a very common problem in not just the digital humanities but across all disciplines and that's how to work with ocr or optical character recognition converting an image that has text within it into raw text this is an essential step in a lot of different tasks because as humanists we might look at an image see text in it and read it as text this is what we do however a computer does not see it as text we have to convert that image into what are known as numerical arrays that can then be parsed by an ocr system to then convert that image into raw text why is this important because ocr allows for images pdfs that are not yet searchable to become searchable most times digital humanists will use off-the-shelf software such as adobe to actually perform ocr but if you're like me and you work with languages that are not english or texts that are poorly formatted or might have some mistakes in there or might have been typed in the early 1930s you're going to find problems and this is because the adobe ocr is not nearly as good as other free software that is out there notably tesseract from google what we're going to be using in this series is a few different libraries i'm going to go over in just a second but that's the general purpose of this series is to teach you how to use python to ocr any document in any language now it's important to note that this is not going to work for handwritten documents that's going to be a different problem that i'm going to address in the future because the machine learning models that we use to solve handwriting problems are different from the machine learning models that we use to solve typed problems at least right now in 2021 so there is a fairly common workflow for solving an ocr problem in python now this workflow is going to be adjusted a little bit depending on the type of document that you're working with and the quality of that document but for the most part a workflow is a system in which you essentially pass a document through a pipeline think of it that way think of a pipe going down a workflow will always be sequential here in this case so this workflow you will open up an image and do that in python you'll use a library called pill which stands for pillow now the way in which you work with pillow will depend largely on your version and i'm going to go over that in a later video when i talk about each of these libraries and how to install them because they can be a little bit tricky sometimes don't let that scare you off i'm going to show you kind of the pitfalls of the installation process and how to overcome them such as putting tesseract in your path all these problems will be addressed in the next video right now think about the problem conceptually remember it's always good to think about a programming problem as a concept first think about it conceptually and how you would solve it and then start implementing those little solutions to tackle it bit by bit so once you have an image open in python then comes opencv opencv allows for you to manipulate an image when you're trying to ocr something you are not going to use the standard off-the-shelf image you are going to convert it you're going to manipulate it you're going to extract bits of it and a lot of cases you're going to do things like binarize it converting it to black and white you might do grayscale i'm going to cover all of that throughout this series and why you would do certain things to certain images essentially this allows for the computer system to perform the model the ocr model from tesseract the machine learning model to be much more accurate because it's working with less data it's not working with color it's working with a binary black and white image we're going to go over all of that in a few videos actually because that's probably one of the more difficult parts of this whole process is manipulating the image correctly finally once the image is manipulated in the correct format for the machine learning model you pass it to the machine learning model now tesseract is a little tricky there's a few different parameters that you can pass to it and those parameters are going to result in either an amazing output of ocr or a very bad output of ocr and i'm going to cover what the parameters are and there's about 14 or so and when to use certain ones over other ones i think there's about a hundred languages represented by tesseract both off the shelf and with custom things that you can download such as the latin ocr projects early modern latin i'm going to cover all that in this video so that you can ocr a text really of any different typed script a lot of scripts that are represented since the invention of the printing press there is even an early modern greek ocr so that's what we're going to cover in this video series but the way in which you approach the problem is a little different depending on not just language but the state of your document think about all the different varieties of a type document we've got tables we've got indices we've got regular novel structure with one single body we've got text with footnotes that you don't want footnotes i'm going to address kind of how to solve a lot of these larger common problems in the digital humanities and when you might want to implement certain solutions over other ones this is going to be a long series but it's important because ocr is a complex problem that requires a broad knowledge of a lot of different libraries and a lot of different methods to solve but once you have a command of them if you devote the time to it you will have that within a month or so of these videos you will be able to solve a wide array of ocr problems relatively easily so a common workflow will look like this here we have an image of it what is a table from world war ii we recognize this as a table but for the computer system this is just an image there is no text here it's a series of pixels of varying degrees of white and gray which we would call the surface or the paper and a series of pixels that are lightish gray to black dark black that represent the text and a little bit of text that is represented with a watermark or a stamp that is in red a common way to solve this problem would be to first get rid of that watermark which we'll go through and show you how to do in this series extract just the table and then start parsing that table so that it can be processed by an ocr to do all that though you need to not just open it up in pillow you need to manipulate the image and you can manipulate it in python using opencv and we'll see an example here of a bad ocr because this watermark is not removed in the series i'm going to show you how to remove things like watermarks so that you can kind of get around bad ocr results if you notice what this opencv has allowed me to do is identify and extract individual rows within this table that's how you solve a table based problem is you need to extract individual rows and in some cases individual cells and then reconstruct it but once you do that you're able to extract individual rows you can save each individual role temporarily as a as a temp file in your and your temp folder in your script directory and then what you're able to do is pass that temp file to tesseract which is then able to take that and actually ocr and what you're not seeing here is the binarization in opencv of that of that table and as you can see we've got some fairly good results with a couple expected mistakes here such as these double quotation marks which is coming from this and in this series i'm going to show you how to eliminate little things like that and actually have better ocr which i've done so this is kind of what our result looks like this is in german obviously uh so um the kant is coming out correctly um this is all looking like good output from german and it's even able to capture our dates relatively well 1 6 45 one six forty five and we see thirty ninety nine ninety five and we see that right here so this is what we're able to do and again this is an early example what i'm going to show you is how to have an initial output like this and then have an improved output by fine-tuning some of the parameters in opencv so overall that's going to be the the objective of this course is to show you all these things and we're going to try to tackle a few different problems in a few different languages in this series problem number one that we're going to solve is a traditional one you've got a text that might be a critical edition or something this is going to be in latin it doesn't matter i'm picking latin only because it's a challenging language to work with in nlp and it's a challenging language to work with when it comes to ocr that's why i'm picking it for the series it's also something that my audience is wanting to see so i'm including it here and i try to always be multilingual with my with my videos so that people from all areas of the digital humanities can benefit what we're going to try to do in this video and what we're going to do and spoiler alert we will do this is we're going to extract this text but for the purpose of our video we are only going to extract the pages that have or the sections of each page that have the body text here and the body text here in other words we are going to try and reject all of this marginalia stuff that occurs in the margins and we're going to try and reject all the footnotes and what you're going to find is that every problem for ocr is a little bit different we're going to use some tricks some things that are consistent in the image to actually help us but for the most part we're going to be trying to work with what are known as bounding boxes to capture blocks of text and eliminate small blocks of text that's going to be coming up in the series we're also going to be working with and this was another request indicee data so a lot of times at the end of books you'll have a long list of names that are in multiple columns these multiple columns need to be solved in a very particular way in tesseract i'm going to show you how to do that and by being able to ocr indices you'll be able to generate names and this is important for another task that i've had a whole series on which is a natural language processing text known as named entity recognition this will allow you to generate quickly a list of named entities for a specific domain once you know how to do this you can cultivate very large ner data sets very very quickly the next thing that we're going to do is we're going to also work with tabular data now there's a couple different ways to solve tabular data in python there's tabula a very good library i'm going to show you how to solve it with opencv and tesseract because you'll find that certain problems in tabula can or cannot be solved with tabula because uh certain problems do not have well-structured properly formatted tables a lot of the times in the digital humanities we're working with primary sources that look like this they look like mid-century tables that were not designed in pdf format that's why i'm tackling the problem with tesseract in this video and not using tabula i'll have a whole other series on how to extract nice tables in a different in a different video series so that's going to be what we're going to solve a couple different text based problems from a linear single column problem such as the mgh edition that we saw a second ago two tabular problems and two multiple column problems if you can solve both all three of these and you can solve them in multiple languages by the end of the series mission accomplished you'll be able to solve any problem that comes your way you will need to look things up and learn new things because every ocr problem is a little bit different but you will have all the tools necessary to solve any ocr problem that comes your way if it is solvable with python that's going to be it for this video i hope you're looking forward to the series on ocr i know i've been looking forward to making it for a very long time so that's going to be where we end now if you've enjoyed it please like and subscribe down below and if you're looking forward to the series consider donating via patreon which is linked in the description down below

Info

Channel: Python Tutorials for Digital Humanities

Views: 56,007

Rating: undefined out of 5

Keywords: python, digital humanities, python for DH, dh, python tutorial, tutorial, python and the humanities, python for the digital humanities, digital history, Python and libraries, python tutorials, python tutorials for digital humanities, intro to ocr, how to ocr in python, introduction to python and ocr, how to ocr, python and ocr, ocr in python, ocr and python, ocr with tesseract, pytesseract, python pytesseract, python and pytesseract, python and pytesseract for ocr, ocring

Id: tQGgGY8mTP0

Channel Id: undefined

Length: 12min 7sec (727 seconds)

Published: Tue Mar 30 2021