Training Tesseract 5 for a New Font

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello there so you decided to train Tesseract with your custom fund so he recognizes things a little bit better that's the video for you so let me jump right into it um so let's first start on how do you train Tesseract I'm going to show a little bit of the documentation um and filling the gaps basically so I think the most important thing is how do you even provide the ground group so how do you provide to Tesseract what you think is right so basically what we want to do here for a custom fund is generate images with that custom fund and on the same file name kind of attached to the same file you want to generate like a text file that describe what's written on the image and also a box file that or that describes that on the image each character what's the location of the character and what character is it that way with that ground truth Tesseract can go and train itself on the new image right so I think that's the first thing that you will try to figure it out right so let's start um so you can see here on the documentation he mentions you gotta have a tiff fire or PNG file whatever we're going to use Tiff for the image that I just mentioned in a Dodge et.txt file for the ground truth so that would be like that whatever is written on the image I'm going to show later how exactly that looks um but also we've got to provide something called The Box file in my experience um the automatic lead generation of box files of Tesseract can be a little bit finicky and as we're generating the ground through ourselves from a custom fund we're going to have both the Box all the text file um the um the image file into the in the Box file um so let me show you how exactly that looks like um the first thing I had to do was figure out a way to generate those images right so turns out you use the texture image application uh it comes with the Tesseract training tools so watch my video in the description if you missed that so after you install Tesseract with its training tools you're going to have texture image on your path so and the problem with texture image is it generates images in a way that was compatible with tester X4 but now Tesseract five test track 5 needs something called line images and those are nothing more than images there's just a line a single line instead of a full text full page of text so the first thing I had to do is figure out training text so like a textbook somewhere in the web we can grab text in this case I used the training text from the link data lstm if you go to the English folder you're going to see you have the english.training text here and that this is just a big file full of English text so that's where I got it from the problem is texture image gets all this text and generates just an insane amount of pages and pages of text with it and we don't want that remember Tesseract 5.01's line images so I developed a quick script that gets this training text file and just separates into a bunch of files that only has one line each I'm going to also show this in a second and after you generate those files you can just run text to image just fine and it's going to generate for you the the Box file and the image itself cool so the first thing I did was creating a link data folder you can see on the top left here and all it is is just this folder here with all those files here we're actually just going to use training text but that address um and that's about it um so and I wrote this python script here which is also going to be available as a wrapper story I'm going to link in the description and this is you can just run it and it's going to do exactly what I said it's going to get this big text file it's going to get each line and create a separate text file for that and then call text to image on that new file to generate everything for us um so let me explain some stuff here actually yeah we're going to use the link data uni chart set to generate uh the new images inertia sets basically kind of the rules of English that helps the neural network figure out exactly how words are formed in certain languages English in this case and you can use whatever you want here there's plenty um those arguments were mostly I got from what Tesseract for test drain.sh file uh used the only thing I changed was the Y size because now instead of being a full page just a small line uh the shower spacing so it's not too not too tight and just one page of course and in this case I didn't say it yet but I'm going to use the Apex Legends font uh I just downloaded from the web let me see if I can find it feedbacks regular OTF so I just download this OTF file from the first Google Chrome Google search I found and that's about it so that's what I'm going to try using okay so let me show you I'm going to walk exactly how my whole folder structure is laid out how exactly everything any of this works in a second but for now let me just run the script so to get an ideal what's going on uh first actually I'm going to create the data folder and the grout data folder so I think the best documentation for training I found to be the readme of this rapper story actually tells you exactly what to do instead of some high level ideas it actually tells you do that and you get training going one thing it says is you got to create a folder called data and then the model name you're going to train in my case it's Apex Dash ground Dash truth so that's what I did here I just created Apex dashboard and now if I run split training.text.pi it should do exactly what I mentioned it's going to get the training text file which is an insane amount of words and you're going to separate into a bunch of small single line text files generate an image from that in a box file from that let me show you how that looks so I'm gonna this is uh Windows WSL by the way with your Ubuntu so I'm going to copy those files over to my actual Windows machine and that looks good actually I'm gonna get to my download folder and done wheel so this is what those files looks like so you can see just an image Windows will walk about page small feeds Garden foreclosure index payment auto this is the text file I generated windows we walk about page mode okay cool though same thing and most importantly generates a box file the Box file is basically each letter in what exact position it is on the image and I imagine size rotation scale maybe I don't know but you can see you see Windows right in here vertically so this is each letter in any space here we and so on so this is all generated by the texture image this is not Tesseract here there's nothing running no machine learning anything so this is just us generating a ground proof so we're generating like a beginning like something that we know Not Human by you know something we completely trust it's not generated by AI and we're going to use that tutoring okay cool so so let's begin training I guess um let me go to test train and this is what it looks like I kind of crafted this command I'm gonna put in the description of the video as well it's gonna be this bad boy okay so explaining some things here um it's going to use the Tesseract folder one folder of test data so what is that let me show you so I basically cloned Tesseract any side pass director is a folder called test theater and it has some English network default settings so this file wasn't here at putting here but all the letters were including those configs fire which is what we're actually looking for which the lstm.train and a bunch of others here so basically if you get clone that's correct inside here if you clone it this holder is going to be in there right but it's missing the english.train data file so we also got to get that so to get that we go here and we click clone the test data best and on that test database you're just gonna find the train data file then you just get this train data file and let it inside a tutorial inside of this right Builder clone inside test data this is all going to be on GitHub again like this whole folder structures I don't care don't care too much about exactly the folder structure just continue watching the video so you understand what I'm doing and then you can do your own folder structures the way you want and so on um Okay so back to explaining exactly what I'm doing with that command um which is this okay so we're saying this direct test data is where the sum of the test data is going to be like the English and so on and then we're going to run this make file here uh oh right this is the test screen repo story this is the one I was talking about that the readme is good so it would definitely want to clone that as well again it's going to be on my GitHub but it would call that yourself if you want to do it from a scratch um and then um inside here right inside test string there's a make file that make files will actually runs a bunch of commands and one of them is the training command um then as a bunch of variables you can specify I'm going to specify the model name being Apex they start model in English that means I'm going to train all those things on top of the English model and then again for some reason also needs this test data again folder same path as before okay and then I'm going to put next iterations actually 100. oh that means I'm going to begin with English and I'm going to run 100 iterations this is very low this is really not a lot 100 iterations you should definitely bump that to something that finishes within the reasonable amount of time you've got to experiment but something like ten thousand twenty thousand might be good you don't want to overfit so you don't want to do a lot um hear the best things experiment you want use the best results see what you can weigh and so on let me actually do a little bit more I'm gonna do 400 iterations so you can see kind of the progress so you begin with 66 66 error rate which is a place of optimal so you can see 100 iterations is really not great so I bumped this to 400 let's see if that's going to improve any so now it's rating of 53 so that's what I'm saying if you spend if you let your computer do this for hours and then maybe two days uh you can probably get a very low error array just again be aware to not over fit the network so um one thing you can do is instead of generating a hundred files I'm going to show later you can generate way more line files for the training data so I can tune the script I created here instead of going through this whole file which has a total of 193 000 lines I just did a hundred right because we gotta finish right for the video but if you remove this limitation here if you just go on my script and comment those few lines that means now it's going to generate 193 000 images for your network your network should train so that might be a little bit better for you so you can see it's dropping roughly 10 each time so you can see if I continue a bunch of iterations it's going to get a really good result and if I try actually evaluating it so this is a command to evaluate so I'm going to evaluate English 1.f so this is an image I'm going to print to the standard output I'm going to use the test data there as this the data folder we just created that has now the check it out apex.train data so this is our model here's the finished model of a new newly generated data I'm going to say this file is just a single line of text which we know it is and I'm going to say the languages that just don't want to create it now Apex and log level all body kicks of it and you can see it works pretty well um so that's kind of it I'm gonna get this rapper story up on the URL so you guys can just download it and start messing with the script um one of the reasons on this script I didn't do any arguments is because I want you guys to change it I want you to go here and change the output directory the name of the model the name of the font I don't want to create a tool that someone can just call it with a bunch of arguments and call it a wrap and it failed and you have no idea why I want you guys to understand what's going on here so you can do way more than just train your custom phone you can do more stuff you can get ground truth which isn't generated by text to image you can get ground truth that is generated by yourself by a human there's many many options here okay so no let me explain exactly how it works uh let me see where do I even begin um so I think I explained Well the box file which is just uh getting the image uh with a ground truth text that the image has written in it in a box file which basically says each character where it's positioned on the screen and which character does it represent and that's basically how you train Tesseract period not just for a custom film for anything really the trick here is how do you get the ground truth in our case as it's just a custom fund we can just generate it automatically robotically and just generate it and then train the data on a massive amount of data right but let's say you want to train with handwritten data now you cannot automatically generate handwritten data right so you've got to scan things and so on you're going to actually generate the ground through tax files and not only that you would also generate the Box files there's there are tools for that if you Google box fire generator Tesseract you should probably find some applications that can do that for you and that's kind of the gist of it um let me see what else I could explain there's the test ring um and I think that's basically the gist of it let's Tesseract with the test data folder and then you put English dot train data inside that's basically all you do and the link data you get from the English folder on the link data all STM rep story so I think that wraps it up if you want to train a different font which isn't uh the one I trained so you can see a specified phone equal Apex and here what you do is just report it you can just select the fault you want and how do you install this font so I think in my case it was something like going to use a local share of funds yep then I put this pump here um which I downloaded from the web and then you can just run let me think FD cash scanner though FC cash FV that's going to force um Ubuntu to re-evaluate the cache of funds it's going to find Apex Legends and cache it and now it's a recognize font on Windows you just double click a font you're gonna see there's an install button see install and after installing the fund you can just specify it here and then you can just use it to train you're getting just a bunch of stuff here as well you can see on the images I kind of generated them a little bit wider than necessary so if you want to really optimize it you could just reduce the image size a bit C there's a lot of white space in the right you could remove a bunch of that you can see there's also a lot of space below but I could not remove it actually it was uh texture image was failing anything too small so 480 walked and I laughed at it could also change the exposure character spacing so you can see they're very spaced out one of my generation so there's a lot of space between them it could increase that overall to remove that whatever you want um that's kind of the gist of it there's not a lot of secret feel free to download my rapper story and change as you wish if you want to make an improvement feel free to also make changes and submit a pull request and I hope it was helpful if you have any problems any questions just let me down let me know in the comments below and we're going to try replying in the timely manner and please subscribe if you want to see more content leave a like and Godspeed
Info
Channel: Gabriel Garcia
Views: 30,549
Rating: undefined out of 5
Keywords: lstm, tesseract, custom font, new font, ocr, train, tune, fine tune, tesserac 5, tesseract 5 ocr
Id: KE4xEzFGSU8
Channel Id: undefined
Length: 17min 23sec (1043 seconds)
Published: Mon Sep 26 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.