Extract Text and Data from Any Document with No Prior ML Experience - AWS Online Tech Talks

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is Kasia from Ron and I'm a senior Solutions Architect in AWS AI I work with large number of AWS customers who are looking to take advantage of machine learning to advance their cloud journey to the next level today's webinar is about extracting text and data from any document with no prior machine learning experience let's quickly look at the agenda so we will start with some context about document processing while documents are still important today and how they are processed and what kind of challenges that organizations face today when they are processing documents we will then look at Amazon text tract and how it makes processing documents at scale with high accuracy with very easy to use simple api's we will have an overview of the product we look at core features we will go through the API s and explore those API is using some demos we will then talk about some of the use cases and some reference architectures and then wrap up with some information on how you can get started with the service about wired documents are still important today as we talk to many customers they tell us one of the problems they want us to solve from the computer vision standpoint is the ability to process documents at scale because documents are still primary tool for record-keeping communication and collaboration across a wide variety of businesses whether it's finance insurance legal medical and so on a lot of information is locked up in unstructured content whether it's paper or for example scanned versions of papers and so on let's look at an example of a document here is our typical mortgage application form that many of you have filled when you purchase a home just alone in 2016 16 point 3 million of these applications were processed generating a business of over two trillion dollars here is another example of a document which many of us in a very familiar with and with this here's taxis and I'm sure many of you are looking at this just alone in 2018 IRS expected that over 240 million of w-2 tax forms will be processed why do we need to process these documents what makes it important is if you have a digitized version to these documents it makes it very easy for example for search and discovery you can quickly search and based us on query get a bunch of relevant documents back instead of going through a huge stack of this unstructured content if you don't have for example it indexed and make it available for for search and discovery similarly compliance and control is a very important aspect like in in enterprise and many industries where you have to follow some regulatory compliance and so on if you have digitized versions of these documents where you have extracted text it makes it a lot a lot more easier to comply with those rules another important aspect is business process automation so a lot of these business processes as documents come in based on the text and data and a lot of that is unstructured data which then might require to kick off other business workflows and can easily be automated if you have digitized versions of these documents instead of just paper-based documents or scanned versions of those let's see how people have kind of figured out how do they process these documents today so some of the mechanisms today are for example where they are just using humans and doing manual processing and I'll talk about you know some of the challenges there other folks are using some of the traditional optical character recognition like OCR and use those to get some text out of documents and then people have figured out to use some rules and ten placed a template based approach to extract some of these matter attacks from some of these documents let's look at challenges that you usually run into especially when you try to automate some of this at scale when you do manual processing it can be expensive it can be error prone and time-consuming let's look at just one example to give you some more idea so here is an example of a form look at this field if you give this R to four different people you might get four different outputs so for example whether exempt is true or CP p /q P P is true or 28th is true or RP c is true so when humans are processing these documents manually you might get variable output and inconsistent results and you might need multiple human reviews for consensus and so on similarly when you use typical OCR based approach to get text from these documents it can be challenges for a wide variety of documents for example when you have documents with multiple columns so here is a very simple document with two columns if you extract that using a traditional OCR solution you get an output like this there are text from the first column it can get appended with the second one and really give you something which is not very meaningful so traditional OCR solutions usually don't get in a multi-column detection and if you have text for example rotate your text or if you have stylized fonts the accuracy drops and it doesn't necessarily get detected in many of those solutions it is another example of a document a lot of documents usually contains for example our tables and if you process this table with their typical OCR solution you can get an output like this some people might you know have figured out that okay maybe I can use spacing for example and then kind of do some templating on top of it to extract this but you can imagine it can become very tricky especially when you have different types of documents containing different types of tables with different sizes of rows and columns and so on OCR you know reads from left to right ignoring the table structure and makes it very difficult to able to to extract tables in a meaningful way and then being able to automate that process another approach where people have kind of figured out using some rule and template based approach again these are limited by the accuracy of OCR itself and then usually require a significant development and management overhead and then you have to create lots and lots of templates and templates are brittle let's look at an example for you to get some more idea so this is a well-known w-2 US tax form and all of these forms have the same information but for even in case of w-2 there has hundreds of variants each year you know the same information just laid out differently so if you are trying to use a rule and template based approach you have to create templates for all of these versions and it can become a nightmare pretty quickly if you have a wide variety of forms that you're trying to army looking at from the computer vision so if you look at some of these forms just to give you some idea why this is a challenging problem as humans we can see most of these forms look pretty similar but if you look at the RGB channel information for example at the blue you see computers see all of these forms very differently this they don't see any any similarity in any of these forms even though some of these look pretty similar to a human eye and that's what makes it complicated and challenging problem and this is where Amazon text track comes in so we announced Amazon to extract at AWS reinvent we're at extract is a machine learning a deep learning based service which can easily extract text and data from virtually any document so let's look at some of the core features of Amazon text track text tract can help you extract tech text and we'll talk about each of these features in detail in a second but text extraction is like OCR plus plus there it gives you the text but you will see it gives you a lot more information about the geometry of that text so you can use that information to process the document in a lot more meaningful fashion it helps you extract tables so you get information about rows and cells and then you can use those for example to do different or thematic operations on the values in the cells etc Amazon text tract allows you to extract forms and gives you key value pairs so in case your forms for example you can easily then get values like first name last name and so on and most importantly to to be able to take advantage of all of these features you don't necessarily need to have any machine learning experience we provide you very simple to use API for each one of these features that we will explore in the next couple of minutes and you can and we provide a couple of SDKs as well in a variety of languages so you can just take advantage of those just call these api's get the results and automate document processing without having to learn about machine learning so let's dig into each one of these features and we'll go through the api's and and how you can take advantage of these so let's first look at at text extraction in Amazon text right so the text extraction feature in this case when type when you send a document to text tract it returns you text with information about for example words lines and page and so on so you get a couple blocks back where each block has a type about like word line and an end page and so on that you can use to process that document the API that you use to extract text is called detect document text in this case the document can be just a blob Rob bytes that you can send to the API or it can be a document sitting in Amazon s3 for example the response of the API is a list of blocks where each block has a type like its page a line or word and so on it has an ID and then it has some relationships about other blocks within the document this is the JSON view of the same API request and response so again you can see this is where you can send the raw bytes or the order or s3 object and then response returned you a list of blocks with information about reduce blocks are in the document and so on so what this makes it easier is to process for example documents with multiple columns so in this case if you use Amazon extract and to get text from this document your output is going to look like this with appropriate reading order and you don't have to have to maintain different templates and so on to be able to get text from from these a multi-column document etcetera let's look at a demo so let's start with looking at an example document here so I have a simple you document here that you can see which doesn't necessarily have a very good resolution and that's the point I want to show here how you can still get very good accuracy with Amazon text track so in this case we have a relatively simple document let's see how we can call detect text to extract text from this document so I'm here in my ADA in my eyes on my CLI let me you you let's run this command here so I'm just gonna call detect text Python simple program which is then going to call Emma don't extract with this image and get the result back parse the blocks and show you the output so you can see in this case we were able to get text from this image by just calling a simple API let me open the the program here so you could see how we are calling alright let me just bring that to the view all right so here is a very simple program that you can see all we're doing is with few lines of code where we just initialize tax track client then we call detect document text passing in the file name in this case I have that document already sitting in my ass three bucket we get a response back with blocks and then we just iterate over those blocks and then print them on the screen now let's see another document where here it's our two column document you close this one here we have multiple documents let's see if we call text tracked with the same document kind of output we get so let me first use you the same program and we called it on a two-column so you can see even though it returned us different blocks with the geometric information we don't take advantage of that so in this case and we don't necessarily get the reading order correct but now let's go ahead and take advantage of the reading order and run this in this case now you can see let me minimize this so we could see it side by side so you can see now we extract the text proper reading order let me show you pretty quickly how we are able to do that so you can see again it's a very simple app again we initialize a text track client called their detect document text API get the response back and then we're just iterating over blocks looking at the type of the block and then doing some math we're using the geometry information that it returned us including you know the bounding box information where each of the block is and then just printing each one of those lines so by taking advantage of the infant digital information that Amazon tax track give you with the even with traditional OCR you're able to process you know documents like multiple columns and and so on now let's look at another example so as you extract some of this text from Amazon to extract how can you maybe then use some of the other AWS services so in this case let's call for example comprehend and pass it the text we extracted to get sentiment analysis and later in the presentation I'll show you how our typical production architecture would look like in this case we'll just call Amazon text fact get the text back and then can call comprehend to get the sentiment analysis so as I run this in this case we got the sentiment back as neutral let's go ahead and see how this one work so again you can see with only few lines of code we initialize the tax track line we initialize a comprehend client we call the detect document text get the response back then for each block we can catenate the text and then send their text to the comprehend and then we just print a sentiment now let's go ahead and look at another example where we can use Amazon Translate for example to basically get this text translated so we run this again in this case you see each line that we extracted using Amazon extract and then we call Amazon Translate to get their translated into into a different language let's look at that one and see how easy it is to build something like this again you see a very simple program we initialize the clients call detect document text get the text back and then simply iterate over each line and for each line we are calling Amazon Translate get the result and then just simply print so I hope that helps you understand to see how easy it is to take advantage of Amazon tax tract API is to extract text and then take advantage of the additional information including bounding box and geometry information that you get for each word line and pages to then process different types of documents including the ones with multiple columns and so on let's now continue with the other features of text tract so now let's talk about table extraction Amazon text tract let's see you extract tables from documents so in this case when you send a document and the response gives you blocks of different types in this case pages table and cell and for each block you get text along with the confidence score and the block relationships for example you know cells within a table and so on so again the API for this one is called analyzed document and you simply pass in the parameters as feature type tables so in this case again a document can be either raw bytes blob that you send or it can be a document sitting in an Amazon s3 and you return the blocks that the response is list of blocks each one has an ID and a relationship with the other blocks in the in the document it is a JSON view similarly similar to previous example where you just set a request with some parameters to whether your object is in Amazon s3 or raw bytes and set the feature type and you get a response where the list of blocks each with their types related to in this case in the context of table things like cells and and rows and so on so how does tax tracked make it simplify to extract tables so if you look at an example like this and we saw earlier how traditional OCR would just give you rows of text in case of Amazon text tract it recognizes the table it grouped those words by cell and give you an output like this so you'll easily see things like you know a column where you have start date end date and you'll see the values for those so very easy to process tables automate the process and then do it at scale now let's switch to a demo and see how our table extraction work with Amazon techs tracked so let me first switch to my console and let's click on try Amazon extract so here you see an example document there it has its in typical employment application form with a table at the bottom so let's go ahead and click on the tables tab in the raw text you just see the traditional bag of words various just showing you all the text that it's seeing but when we go to table you can see Amazon text tract is able to detect the table and then we were able to show all the columns and all the rows along with all the words that's part of that let's switch to a program here and let me show you how you can take advantage of the API and build something like this so let's go to analyze document table you so in this case let me first show you the form document that we're going to process it I have the same image that you just saw in the console so we're gonna use this by calling the API and see what kind of results we get back so let me run this so see in this case we call the API got the results back and generated this output where you can see you know we're able to detect each rows and columns and so on let me open this and show you how we're doing this behind the scene by calling the API so again in this case and the program start by taking the input file it simply calls Amazon text tract to get the results back you know there is just simple code to initialize the client get the result back extract all the blocks and then it just you know take advantage of the the different block types for example things about information about cells and rows row index and column index and so on and then just format those blocks to show you how the table looks like let's look at another example where a lot of times other use cases where maybe you want to extract this table but maybe export is at as a CSV or something so let's see if we run a different program in this case it generated output as output dot CSV so let me just open this you so you can see in this case because we extracted the exact information and that may open the document so we could see it side by side in this case you can see we we got the values like columns like start date end date employee name and so on and were able to reconstruct the tip alright great with that let's look at the next feature you which is form extraction so Amazon text rack allows you to send a document where it can detect forms so it can return you form field names in the form of key and an Associated value it returns you along with that confidence score and things like page number and the relationships between different blocks for example so the API for form detection is the same API analyzed document in this case you simply pass the feature type as forms and you get a response back with the list of blocks with either each block as an ID it can have our type like page and then key about key value as well as relationship with the other blocks on the page it is the JSON a view for that request and response so again in this case we're just sending either as raw bro bytes or an Amazon s3 object and then setting the feature type to form and you get a list of blocks back with the geometry information things like bounding box and so on as well as information about platform so amazon to extract makes it simple to process forms so for example in this case if you have a form like this and you send it to Amazon tax tract you get a logical grouping gets captured you are ever able to you know capture the relationships and you know glyphs are captured so it's a lot more easier to then process the information so you can see you see things like in a first name and last name and so on and then easily process them in your in your business process workflow now let's look at a quick demo and see how form extraction work in Amazon to extract so we'll use the same example let's actually first go to the console we'll look at the same example so here we have an employee employment application for the same one let's see how when we go to the form you can see it's able to detect things like blame you know phone number home address and so on and then because we are giving you the geometry information of all of these key values it's able to then draw a bounding box and so on let's also look at from the API perspective how we call that so let's let me just clear this and go to ORMs let's see an example here we have a simple horn parser which is going to look at this document extract different key values and it's just then printed all the key values it found and then you know you can ask tell me you know if you want to search for a specific value yes I can just say okay tell me what is the value of full name in this document or what is the value of phone number in this document and and so on so very simply you can take advantage of the Amazon to extract API r2 process forms extract those key values which then can be used to kick off other workflows stored in you know data stores like dynamo DB for fast look ups or can be stored in things like elasticsearch to enable you know indexing and searching and those kind of use cases let's have a quick look at parser so you can see again this is a relatively simple program you again initialize a client they get the text back using the analyzed document API and in this case passing the feature type as forms we get the list of blocks back and then it's just a matter of processing those blocks based on their type in this case you know using the key value set and then just processing those and rest of the program is just about asking which really you wanna search for and then look for the dictionary and print those values that we extract using Amazon text tract are that let's move on cool so we looked at different api's including the tech text and analyze document for text extraction table extraction and form extraction so Amazon text track support both synchronous and asynchronous API is for synchronous API s these are where you have low latency use cases where you have a single page document and for example maybe you're taking a mobile capture so you directly call em isn't extract get the JSON back you can process it to then pre populate the form for example and get the results and process them then when you have multi-page documents where maybe you have hundreds or thousands of pages you can use asynchronous API where you can call Amazon tax tract it can start a job to process the document when the job is complete it notifies you and then you can get the results and process them and store them in you know whichever data store for example that you using so this is how the asynchronous API a workflow work for either one of these whether you are detecting text or you are analyzing document and two for form or table extraction you can start the job by calling either start document text detection or start document analysis and the request as you can see the JSON where you send that document can be in an s3 bucket and you send some additional information about the s and s topic through which you want to get notified and what you get back is a job ID now to get results back you can simply call the get document text detection or get document analysis send the job ID and then you get back a list of blocks again with the with different types of these blocks depending on the api's you use as well as all the geometric information things like bounding box and so on and relationship with other blocks in the page in addition to calling api's or using AWS console you can also use AWS CLI to quickly call Amazon text track and extract text for example so here is some API reference for AWS CLI for each one of these api's you can just quickly go to CLI type this command and get the document analysis done for both synchronous as well as asynchronous version of the API now let's look at under the hood how Amazon tax wrecked is able to process some of the documents which maybe some of the traditional OCR solutions not necessarily give you high quality so think looking at on these examples of documents for example then there is orientation because Amazon tax tract is a deep learning based service orientation and some of these things which can be challenging for many other solutions Amazon tax tract is able to give you much higher accuracy because we are able to detect or different orientations and without you having to worry about this still give you high accuracy at results with high accuracy similarly structure variability so if you have different forms where maybe things change a little bit it's not going to break just because one of these fields you know change by a few centimeters or something and because we are template agnostic Amazon text tract is going to automatically still continue to work even when some of the structure of these documents change without having you to tell us about the structure of these documents similarly you can have a variety of different different documents and based on the different deep learning techniques Amazon tax tract is going to know about these different documents without you having to define templates etc for these documents there can be many documents as we understand these are scanned you know or maybe use like phones and some of the other other techniques to to capture these documents so you know you could you could get this you know photometric track here for example and but in this case again Amazon tax track is able to take advantage of different techniques things like we're going to do segmentation and rectification and to make sure that we can still a detect acts with high accuracy in these documents similarly I have many documents like receipts or invoices for example there could be and it's hard to get those with perfect geometry and again Amazon tax tracked in this case is going to be able to take advantage of things like segmentation rectification to give you the high accuracy results and still detect those again understanding the document structure and the context to find table so in this case we were able to detect their table when there is you know good boundaries and so on or even understand cells for example then there are you know no explicit boundaries there so we were able to take advantage of different deep learning techniques to still extract those you could have tables for example with variable size rows or columns again it's Amazon tax tract is going to be able to handle these kind of variations things like detecting phrases or groups of words so forms like these where you have full name you know John with its first name middle name and last name Amazon to extract is going to be able to extract these as key value for for easy consumption here similarly going beyond traditional OCR in where we are able to infer the key value Association so detecting the structure of the same document without template yeah so as I mentioned earlier there could be hundreds of variations of w-2 form for example again you don't have to define different templates and to extract is still able to extract those key values for you similarly there are use cases where maybe you have a form like this so there for example the key is above the name whereas in another instance the key is below the name and you can expect Amazon to extract to detect that automatically without you having to define any templates or any of that there are use cases where maybe South values could be empty so for example in this case the banal name is empty so instead of you having to then manually review those or some of those things Amazon text rack can infer those empty values and for example in this case can return null for for middle name here make it a lot more easier to automate and process these documents at scale now let's look at some of the reference architectures so I showed you some of the examples earlier when we ran some Python code for example to call text rect and then couple for the AWS services so this is how you can build a traditional where you wanna you know index a large number of documents and then you can search those so our traditional workflow will look like where you bring in those documents to something like Amazon s3 bucket which can then trigger a lambda function lambda can then call Amazon text track get all the text back including for example things like tables and forms and then based on all the data you get back you can then store it in something like elastic search and then consume it you know using different different tools similarly for NLP use cases where you want to do natural language processing and we saw some of these examples earlier in the presentation where we call it like comprehend as well as Amazon translate so in this case your documents could come to s3 which can again trigger a lambda function lambda can then call text track to get the text out of the document and then it can call languages like comprehend Amazon translate or comprehend medical for example for medical analysis and then all that extracted data and analyzed then can be stored in services like elastic search and then it can be consumed to discover you know insights for example in this case using medical for medical documents to improve patient care and for a wide variety of other document types and so on another example is where you could be taking for example form capture using a mobile document using a mobile app so in this case and end-user takes a picture using their mobile phone it can then call em isn't extract get the results back and pre populate the form on the end-user screen which the person can then review confirm submit and then it can be stored and you know some sort of data store and then processed or depending on on the business workflow needs some of the launch customers we had when we launched Amazon tech strike or or alfresco and Cox automotive and we have some details about their use case on Amazon tax track website where you can find more details to wrap up and it's one of the benefits of our Amazon tax tract is where you are able to extract data quickly and accurately it eliminates the need of manual processing which really allows you to lower the cost of document processing and most importantly from the development perspective you do not need any machine learning experience just by taking advantage of very simple to use ap is and in addition as we provide different SDKs it makes it further easier to call the api's you are quickly able to build pretty powerful solutions and automated document processing at a much higher scale Amazon tax tract has tiered pricing and in addition as you expect like other AWS services as a free tier as well where you can take advantage of up to thousand pages or images for tax detection and up to hundred pages or images for table or form detection at much tax tract is available in these four regions in in Ireland in Ohio in Virginia and Oregon to get started here's the link that you can go to I sign up for preview or you can learn more about the service well thank you

Info

Channel: AWS Online Tech Talks

Views: 38,352

Rating: undefined out of 5

Keywords: Textract, OCR, Optical Character Recognition, Document detection, key-value pair extraction, Amazon Textract, Document Processing, AWS, Webinar, Cloud Computing, Amazon Web Services

Id: 5g48uf5sCu8

Channel Id: undefined

Length: 39min 48sec (2388 seconds)

Published: Thu Mar 21 2019