OCR Using Microsoft's Phi-3 Vision Model on Free Google Colab

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome fellow Learners Microsoft has recently released multimodel model 53 Vision which is state-of-the-art open source model as it is multimodel it can handle Vision data along with text Data this model belongs to Microsoft's 53 family and comes with 128k context length so if you see their page in hugging phas along with latency and compute benefits this model can be used in three different use cases the first one is general image understanding second one is OCR and the third one is chart and table understanding so in this video we are going to implement this model in free collab space such that everyone can implement it so we will be taking a use case of OCR and we'll implement this model so let's get started so to implement this 53 Vision model let's first switch to collab notebook let's first write down what are the different steps that we are going to follow in this notebook to implement this 53 Vision model in the first step we are going to install all the libraries required for this 53 Vision model in the second step we are going to import this 53 Vision model from hugging phase using Transformer library in the third step we are going to import the processor required for this model we will again use the hugging phase and Transformer library in the fourth step we are going to create a prompt which will be required to pass to this this model at the inference time and also we will uh import the input image that we are going to pass uh to this uh model for the inference purpose in the fifth and final step we are going to do the inference uh we will take this model and convert uh this propt and input image to input token such that it can be passed to the model at inference time and then we will uh analyze the result how it is responding to this uh uh input image Ag and the prompt that we have given to this model the use case that we are going to basically see in this particular notebook is uh we are going to Fest the OCR result from an invoice and we will see the result how does this 53 Vision model perform in the case of OCR when we are trying to do OCR for an uh invoice image so let's implement the first step of installing all the libraries required for this 53 Vision model and also before uh basically inst in these libraries we need to connect this notebook to a GPU server uh as you can see I have already selected this T4 GPU but if you haven't selected you can go to this runtime and select this change runtime type and here you can select this T4 GPU and then click on Save once you have done that you can uh click on this and you will be connected to a T4 GPU server so let's uh uncomment this uh line and you can see different libraries that we need to install uh you can see all the libraries that are required to install for this 53 Vision model from this uh hugging face page of this 53 Vision model if you scroll down you can see these are all the libraries required to install so let's uh run this cell if you see the different libraries that I have used here is numpy pilow request torch torch Vision Transformers I have also specified the exact version which are written over this page of this 53 Vision model the only thing that you will see that I haven't used is Flash attention the problem with the flash attention is it doesn't with work with all kind of gpus so it doesn't with work with this T4 uh GPU server so that's why I haven't installed this uh flash attention part uh the main uh purpose of using flash attention in this model is basically to enhance the uh in basically inference and training speed both as uh you might know about as these uh attention or Transformer based models are basically having a Bott bottleneck of self attention so the self attention basically have quadratic time and space complexity both so that's why it creates uh bottleneck when we are trying to basically expedite the inference and this training training speed but uh at that part uh this flash attention really helps to basically increase the training and inference speed but in our case we cannot use this flash attenion as uh it it is not basically it doesn't work with the T4 GPU server so that's okay we are going to implement this uh implement this 53 Vision model without uh this flash attention so now those libraries are installed uh so let's uh it is asking to restart this session so let's restart this session once we are done with installing the required libraries it was mentioned in this uh particular 53 Vision uh hugging face page so we are done with that so we also need to install this accelerate so this accelerate basically helps with pytor to basically reduce reduce the boiler plate code so once we are done with uh installing this excelerate uh we again need to restart our server so you can restart your server from here you can click on this restart session and it will restart this particular session so once it has been restarted we can uh comment it and and then we can move on to the next part of the code so in the next part I'm going to create uh a folder over here where I will be basically saving my 53 Vision model so uh I will create this folder and uh then you can click on this uh basically Explorer and you can see this folder has been created this my models and under it there is a folder named 53 Vision at start this folder will be empty and we are going to save our model in this folder so now we are done with uh installing all the required libraries and we have also uh basically created this folder next we are going to uh import our 53 Vision model so to import this 53 Vision model we are going to use this Auto model for caal LM so there are Bas there are basically two kind of language modeling one is this caal LM and second one is mass language modeling so if so if we talk about Calum it is basically uh used when we are dealing with auto regressive kind of models like gpt2 GPT or GPT 3.5 those are basically uh Auto regressive model where the model can only see the tokens on it on its left side it won't be able to see the tokens uh on its right side or in the future tokens and the mass language modeling is where it can see uh the tokens uh the model at one step can see the tokens in both sides and basically uh it ask few of the tokens to predict those tokens so as this 53 Vision model is basically a causal language modeling type of model so we are going to use this Auto model for Calum and we can we will be using this model ID you can uh check your model ID from here which particular model ID you are going to use so you can copy from here and then directly paste it over here so this will be our model ID and uh then we can use the from pretent uh function from this Auto model for LM and uh we can pass the different arguments which are required for this model so now we are having our model ID we are having the folder where we need to save this model so we can basically uh put it in cach a where we need to save this model and uh as we are using GPU we can pass this Cuda to device map uh this test remote code we need to pass so that it it can trust the code from the Hing phase and basically uh get and F the all the s code for this model and also like this to type is auto it will automatically determine the data type of the uh this uh the weights and other different uh values of this model so let's run this cell and let's see what it uh basically gives us uh in The Next Step so the thing is yeah so you can see there is an error over here so if you can see uh it is saying that this model file requires the following packages that were not found in your environment flash attention so run pip install Flash attention but as we cannot we are not able to use Flash attention with this T4 kind of gpus so uh if you go to the page of this 53 Vision uh model in this hugging phase uh in the uh bottom of this page uh uh the basically uh they have also provided a way where you need not to use FL attension you and you can still uh implement this model so so what we need to do first we need to basically basically comment these four lines in this modeling 53. py file and then uh we need to pass this attention implementation as eager so when we are going to pass this attention implementation as eager it will basically uh uh make this model for like custom uh attention type so or basically manual attention type so let's do that so as we have stored our model in this folder so now we can see uh model is there uh it is not complete model as it has encountered the error but that 53 uh file will be there so we can change and uh basically comment out the required lines that was given in that particular page so uh if we expand this we can see these lines this flash attention lines we need to comment uh we can comment these lines and then save it uh once we are done saving we need to basically uh restart our server and uh we can restart our this uh this session directly from here we can click on restart session and once we are done with the restart we can again run uh run this uh particular cell where we basically are fetching this 53 Vision model so as we have seen in that page we need to also pass this argument attention implementation as an eager to basically uh uh where we need not to use the flash attension part so now we have done both the task to where we need not to use Flash enson so let's run this cell and let's see how does it work will it work or uh will it again give us the error so now you can see this is uh uh basically uh installing or basically downloading the model so once it is being done downloading it will basically uh import that model to our this jupter notebook so you can see the first part of this model there are basically two safe tensors that that are being over here so you can see that how much size of this model is from this uh files section of this page also so you can see there are two safe tensor part so one is around 4.94 GB and there another one is 3.35 G GB so this model will basically download both of these uh uh part of this model and then basically import this uh model in this notebook one another thing that you can uh see from here is like you can go to this and use the view resource part where you can basically see what is the different Ram dis size available for for your model now so you can see the system Ram it has used around 1.6 GB and total is around 12.7 GB GPU Ram it has used is around 5 GB and this size it has used around 37.8 GB similarly you can monitor how much basically this model is utilizing your RAM and GPU part so now you can see uh it has uh basically imported this model in our Jupiter notebook so the next part is we are going to use this uh processor we are basically going to import this processor uh using this uh processor we are going to convert this input that prompt and basically the uh image and the text that we are going to pass to this model in a format which can be basically acceptable for this particular uh 53 Vision model so let's run this cell we are also using this Auto processor from Transformer only so once we have imported this uh processor uh we are good to go and uh we can basically import our prompt and then convert that prompt into input token so now you can see we have also imported this processor for for our 53 Vision model uh so we can also print the dock string for this processor and uh we can see what are the different things it involves so you can see it constructs a 53v processor which wraps a 53v image processor and a Lama toer into a single processor so like for text and image part both so for the text part it will be using llama tokenizer tokenizer and the for the image part it will be using 53 V image processor now we are done with the import part so let's uh create our prompt so if you go to this uh uh page of this 53 Vision model they have specified which kind of prompt we need to use over here so if you see uh it needs to pass the image as first part in the uh basically prompt and then second we can uh basically uh write down what are the instruction we need to pass to it and in the final step we are going to basically pass uh the assistant and after this uh assistant will be basically giving us the answer so if you if you see I have written over here as like what are the role and the out basically input instruction which I'm going to pass to this uh basically you can say 53 Vision model so I have given this as role user and the first part is same as like image and the second is like I have passed this to uh the instruction what what I want this model to basically achieve so if you see I have written like provide o for all the text in the given image and in markdown format so that we can basically easily understand uh and and the output should be a in a better printed way so let's uh run this cell uh and another thing like uh while converting this basically simple text into a prompt I'm using basically uh apply chat template function from this processor only uh two things I have passed over here is like one is tokiz equals to false we are not uh tokenizing this model uh right now we will do it later and uh second one is ADD generation prompt so this what this ad generation prompt will do as we have seen in this uh 53 uh Vision page of this hugging phase like this uh we need to pass a final uh assistant uh basically uh you can say word or token over here such that it can make sure that now the assistant uh work has been started and after this assistant will write down what are the the output we need to see so uh when we are adding this now we will run this and you can see uh in your prompt basically this will be added so if if you can see this assistant is added at last so we are not using tokenize at this step we will basically tokenize this after we are also we have also imported the input image to us so to import this input image we are going to use this uh p and basically I'm using a web image so I will use this request function so let's run this cell and basically see which particular invoice image we are using so this is the image that we are uh going to use for this particular uh use case so it has basically different things like Bill to si2 invoice number invoice date Po and all those values also like total and subtotal so this is the kind of invoice image that we are going to use so now let's convert this uh basically the input put both image and the prompt that we have passed prompt into a uh input tokens using that llama processor and this uh image using that 53b processor so let's run this cell and then we can see how does it converted this our prompt and image to basically input tokens so you can see this is the image part and uh at start you can see this is the tensor or the input tokens for this prompt it is also used atton mask over here so now we are having our input input IDs and input tokens which we can pass to our to our model so next we are also going to use few arguments which are we are going to pass to our model so the first one is like uh we are going to use this Max new tokens so it will basically uh we can pass different numbers of it I have used only 500 you can increase it to th000 or something so that uh that number of tokens can be generated I using temperature equals to Z such that there will be no Randomness and I'm also using do sample equals to false so let's run this cell and we are also done with using the arguments which we are which we are going to pass to the language model so now the final step is we are going to inference using this model so you can see I have passed all those things that we have used in the previous step like arguments the input and also the processor uh this one is like what is our end of sentence token so there is already an end of sentence token in the tokenizer of this processor so we are directly going to pass it to this particular generate function so let's run this cell now we can see uh how does it performs and also like how much time this particular uh model will take uh at this T4 for this T4 GPU uh to generate these 500 tokens using this particular image and the prompt which we have passed to it so it will run this 53 Vision model and it will fetch us the output which we can compare in the next step like which we can basically analyze in the next step and see how does this give us the result so uh let it run let's talk about the next step that we are doing over here so if you see in the next step uh we are I'm basically removing the input ID so input ID is will be that prompt token that I have passed to it and also the image uh 53 uh which has been converted from 53v processor so if you can see this has ran it in only 43 seconds using this only simple small T4 GPU so and then uh we can do this we can basically uh remove the input tokens and then use this processor to basically decode it uh so that we can understand the output so it will be decoded into a text format so now you can see the result so you can see uh it has uh clearly mark down this and uh it is really uh in a good good basically way so that we can uh directly see and compare the result uh from the image so like it has basically uh converted it into a table all the these few things which were basically in between of this image like front and rear brake cables label label 3 hours new set of pedal arm and the subtotal so let's compare a few of them so let's compare this front and rear brakes it is 100 100 over here and total is around 54.6 so let's see uh what is the output so we can see uh the total is same of $14.06 and this front and rear brake cables is also $100 and uh you can see it has fed all other details also like invoice Bill to SI to invoice number invoice date uh po du date and all those values which were basically written in the image it has also uh return this payment is due within 15 days it is it is written at the bottom of the that invoice image so great we have come up a way to implement this 53 Vision model without using flash attention and it still gives us pretty good results so basically we have also implemented this 53 Vision model in free Google collab space such that everyone can implement this without any compute resource problem so we came up a long way in this video we have basically first uh installed all the required libraries as per given in the 53 Vision page of this this hugging phase then we have basically imported our model then we also imported the processor that was required for this 5 Vision model and then we created our prompt and then uh imported the input image that we were going to pass to this uh 53 Vision model and then in the final step we converted those input image and the uh prompt in a suitable format to this model uh which I can say like to input token IDs and then uh that image to basically converted through 53 V processor and then uh we use that those to finally uh do the inference and we found out it uh really works uh really works well for the OCR output and we have seen that this 53 Vision model can be used for OCR purpose also so great thanks for making till the end of this video goodbye until the next time

Info

Channel: TheAILearner

Views: 2,553

Rating: undefined out of 5

Keywords: phi-3, phi-3-vision, multimodal, multimodel

Id: 60P-lILHcCA

Channel Id: undefined

Length: 21min 51sec (1311 seconds)

Published: Wed May 29 2024