NEW GPT-4 Vision API: Best Way to Copy Text from Image (OCR in Python)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the amazing Vision capability of chat GPT to both analyze and describe images is now available through new API from open AI called gbt 4 vision and I want to see how well it worked on a real world business use case like extracting text Data from an image so let's start by reviewing a small python app I wrote that connects to the GPT 4 Vision API extracts data from an image and stores it in a structured file in this case a Json file then we'll run some tests to see how well this API performs and how accurate it is and we'll even push it a bit by trying to get it to process handwriting and then we'll go through some next steps and further improvements you can do as well as how you can combine the vision API with other AI apis to take some business processes to the next level it's pretty exciting stuff all right let's start by jumping into the code the GPT 4 Vision API is actually really easy to use and it works just like the other GPT apis from open AI the big difference is you have to pass in an image file and there's two ways to do that so the first is you can just take an image file from your local system so you can see this here so you take this file and then we actually encode it so you have to encode it to base 64 to be able to pass it to the API and so I have a method here that does that encoding the other option is you can just pass in a URL from the internet so you see this commented line here where we just have a a big link to the jpeg so you can swap this image URL variable to use either with this application and I'll put a link to the GitHub in the description of the video so you can check it out if you want next we just have to create a an open AI client and the best practice here is to use the open AI API key environment variable but if you want to be lazy and just put it in the code I've commented out a line here you can do that as well and then next we just create our chat completion and the model here will be gp-4 dvision D preview and then inside messages we set the r as user and we still want to put in a prompt to go along with the image to tell it what to do with it so what I'm going to do in here is say return Json object with data and say only return Json not other text because if you don't put this part in it'll respond with text saying this is here's a Json object and other text that's not useful this isn't actually the best way to do it there's a better way using functions I'll talk a bit about that later is the next step but for now this seems to actually work really well and then next we have to actually pass the image so all you have to do is just put in your url here so in this case this is going to be the B 64 encoded local jpeg we're passing in and that's it for our request and now the response is going to come back in this response variable we have to dig into there to get the actual message of it so you have to look inside response uh choices at index z. message. content and that's where the actual response from gp4 comes back in and an issue with not using a function and just getting the results back in the response it returns with markdown formatting so we just have a line here that simply gets rid of the the markdown and this seems to work fine and then from there we just convert it to a Json object and save it to a file in our data folder and we're going to use the same file name as the image that was originally passed so then you have kind of a pairing you have the actual image file and then the Json file and then simply print a message at the end saying we we've saved all the data let's just test it out and for this example I'm going to use this invoice I have here so if I open that up let's just see how it handles a nice simple clean invoice so we just run this there we go it took about 3 or 4 seconds and it says Json data save to skynova invoice. Json so let me open open up side by side the original image and then the Json file produced and from everything I'm checking it's got 100% accuracy it's got every character correct so what I'm really impressed with here is how it named these fields and how it's structured this file there's the heading here for business info which is on the top left here and so broke those down to business name business address business City it's intelligently structured this file to just match what's on this invoice so it's structured perfectly every names perfect really impressive stuff considering we didn't tell it anything about the image we were passing in we just asked for a Json file but let's bump up the complexity a bit cuz that was pretty easy file to read so I'm just going to swap out the image URL for the one that's just the online URL link and this file is handwritten which is always tough for the system to detect the characters it's also been skewed a bit I think it's been scanned so it's it's not as easy to read okay let's look at the split screen again and again it really got everything correct so it really recognized what were sections and what was data so for example here it knew that application information was kind of a section of the of the document so put that as one level above and then underneath that it had the the nice field names again full name phone number home address mailing address even some things that could confuse systems like for example the date here and the start date these slashes kind of look like ones actually so if it wasn't intelligent enough to know that it was a date it might have put just put a number there there but it identified it correctly and read it as a slash and again with the header row it detected the header row here so it didn't add that into here it just said previous employment history and then put the different rows as they came in the document a lot of intelligence here to make this another perfect document that represents this image super impressive and the reason I'm so impressed by this is I've been involved in a lot of Enterprise IT projects and they're big projects like million dooll projects and one of the big parts of it is the capture component of it where all you're doing is just taking in images and grabbing the data off it the fact that I can whip something up in Python you know it took me an hour or two and it's so accurate and so intelligent and a way extracts the data really wasn't possible a few months ago it really blows my mind how well it works so what's the next steps here one of them is I want to use functions instead of using that prompt engineering to bring back the Json object so make sure you subscribe if you're interested in that video and then the big next step is really around in gentic systems such as Auto gem I can see turn this app we built today into a data entry agent and then have that talk to other agents in the world workflow you got have an accounts payable agent that looked at different accounting systems figured out when to pay the invoice if it had to pay the invoice you have a fraud agent this could all be overseen by a manager agent which has quality assurance and make sure the process is running smoothly this is so much potential for Enterprise systems like that with AI now I hope you enjoyed the video I'll talk to you in the next one
Info
Channel: AI Unleashed
Views: 6,451
Rating: undefined out of 5
Keywords: OpenAI, GPT4V, GPT4 Vision, OCR, Data Extraction
Id: dhYumF7SQdA
Channel Id: undefined
Length: 5min 50sec (350 seconds)
Published: Wed Nov 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.