PaliGemma by Google: Train Model on Custom Detection Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Vision language models are changing how we interact with AI on this channel we experimented with multimodel models before but those were large foundational models with billions of parameters like cvlm that can't be easily fine tune on regular Hardware well today I'm taking you on a journey as I'm going to fine tune my very first multimodel model pojama polyma can perform multiple Vision tasks it can be used for image captioning visual question answering and OCR but what makes it super interesting is its ability to perform object detection and instance segmentation all you need to do is use the detect and segment keywords in your prompt polyma is family of Open Source Vision language models released by Google few weeks ago it integrates Seagle in image encoder and Gemma Tob text decoder connected by projection layer sigp is powerful model designed to understand both images and text similar to clip it features image and text encoders it can be used for zero shut image classification image similarity clustering and much more Gemma is a family of lightweight open-source text generation models from Google and specialized in tasks such as question answering and summarization polyma is pre-trained on image text pairs making it adaptable for fine-tuning Google released a notebook demonstrating how to do it for image captioning but I wanted to take it a step further I spent probably dozens of hours trying to figure out how to modify this notebook to use it for custom object detection now let me show you how I fune my very first VM as usual I prepared a Google collab so you can open it up in separate Tab and follow along the link to the code is in the description below but you can also find it in our roof flow notebooks repository just scroll down and look for fine tune PMA on object detection data set and click open in collab button before we start we need to gain access to poly model this requires logging into kagle or creating an account if you don't have one and accepting the polyma license as you can see I already done this now we can download one of pre-trained versions of polyma the versions differ in image resolution and the length of text sequence they accept due to the limited resources of Google collab will fine tune the smallest one polyma 3eb PT 224 the next step is to configure our API keys to download the model we'll need our kagle username and kagle key we can generate this in kagle settings panel by clicking the create new token button this will download a Json file containing all necessary information to download the data set we need roof flow AP key we just click on the Avatar in the upper right corner of roof flow UI and select a API keys from the drop-down after clicking copy our key will be in the clipboard next we return to Google callup and click the key icon in the left side panel to Open Secret stop please follow the instructions in the notebook and set all required values in the end your site panel should look just like mine now I'll run Nvidia semi command to confirm that our collab has GPU acceleration to F tune pjama will need at least Nvidia T4 with 16 GB of vram and 12 GB of RAM you can verify this by clicking runtime and selecting change runtime type from the drop-down menu now in the popup you can just choose the hardware you like and click save now it's time to install some external dependencies the first is supervision a computer vision Swiss army knife that we will use among other things to visualize polya results and Benchmark our fine-tuned model the second is roof flow which we will use to download our data set from roof flow Universe I decided to use a data set of handwritten digits and Ma operations it's one of the data sets included in roof flow 100 data set a collection of 100 diverse data sets that you can use to Benchmark zero shut detection models side note we Benchmark polyma on RF 100 more on that later in the video to download the data set from roof flow Universe click download data set select the desired format and click continue this will generate a code snippet which you can simply copy and paste into Google collab now simply run the cell to download the data set while we wait for the data set to download let me tell you a little bit more about the required data format to find tune polyma we'll need a data set in Json L format it's similar to Json but each line is an independent Json object each Json contains three keys image prefix and suffix image holds the path to Associated image prefix is the prompt that we are going to send to the model and and suix is the expected response it's simple but versatile and allow us to fine-tune the model for various Downstream tasks when used for image captioning the prefix could be describe the image and the suix would be a description like a man with a dog when used for vqa visual question answering the prefix could be what breed of the dog is it and the suffix would be the answer like a bagle object detection is where things get interesting do you remember the unusual result we got from pre-trained polyma where we asked it to detect dog similar to other tasks we have a prefix and a suffix for instance the prefix is the tech dog where the keyword detect is critical the sufix has a specific format four consecutive log tags followed by desired class name if there are multiple objects another section appears separated by semicolon we can also search for multiple classes by listing them in the prefix once again separated by semicolons most of you probably already guessed that the four lock tags Define a location of the bounding box but the format itself is quite surprising imagine an image with a detection defined by coordinates X1 X2 y1 and Y2 the image has height H and the width W the first thing we need to do is to set the coordinates in the right order y1 X1 Y2 X2 we normalize the image by dividing them by H and W respectively and and multiply them by 1024 let's assume that in our case we got 300 400 500 and 600 we place each within a log tag ensuring they are zero padded to form four digigit sequences finally we append the class name after the loog tax by the way right now you can download any object detection data set in this exact format go there explore the link is in the description now we can use Linux head commands to print the first five lines of train and validation subsets as described each line is a separate Json object containing Keys image prefix and suix let's select 25 samples from our training set and visualize them we can see that each image contains only one object this this is quite important and I will discuss this further at the end of the video when I will cover the limitations of pjama most of the code you will see right now is pretty much the same as in Google AI notebook I mentioned earlier the changes I made were related to parsing and visualizing object detection results the first step is to clone the big Vision repository where Google released pojama the library is written in Jax a deep learning framework and alternative to tensor flow and pytor it offers automatic differentiation and just in time compilation allowing for faster and more efficient model training on our Channel we have a video showing the basics of jacks if you're interested the link is in the top right corner once everything is installed we import Jacks big vision and other necessary libraries we also check what and how many gpus and tpus we have in our case of course we have a single Nvidia T4 GPU to limit GPU memory consumption and enable fine-tuning in Google collab we'll use the smallest version of polyma freeb PT 224 let's download the based model from kagle using preconfigured API Keys importantly you must accept pemma license in model card on kagle without that kagle will simply deny you the access to the model the D itself will take several minutes so let's use the magic of Cinema to speed it [Music] up polyma has 3 billion parameters 2 billion for Gemma language model and about 1 billion for the image encoder due to the limited resources available in Google collab we need to limit the number of trainable parameters and freeze the rest of them we'll fine tune only the attention layers of our language model this of course will limit the trainability of our model but at the same time protect us from outofmemory exception that we all hate we load our model into the GPU and print the layers that make up the model Frozen layers have been loaded as float 16s and those to be trained as float 32 time to work on data loaders first we'll prepare a few helper functions process image that converts the input image to grayscale and resizes the image to the size required by the model in our case 224x 224 the pre-processed tokens function that tokenizes and formats input text applying attention mask to distinguish between prefix and suffix and the postprocess tokens function that decodes model output back into human readable text next we'll Define the train and validation data loaders utilizing the functions above we'll also set the SE length parameter in our case it's 128 representing the total number of tokens generated for the model during the training to enable batching all examples need to have the same length the prefix and the suix will be joined separated by the new line character overly long examples will be truncated while short ones will be padded with zeros this parameter directly impacts the memory usage the higher it is the more vram you will need however if your data set have a lot of classes and because of that longer prefixes you will need to make it higher anyway I've made small modification to this functions compared to the original Google AI notebook particularly in the validation data loader it was not returning the ground roof which we will need for benchmarking the model as you will see shortly here are a few examples from our data set it's worth checking if the text have been trated to the overly long sequences as we can see loading and parsing work correctly finally we Define the training and evaluation Loops the training Loop includes a step function to calculate calate the loss per example and apply stochastic grade in descent to update the model's parameters the evaluation Loop includes a prediction function to make predictions on validation data set and assess the model performance now that we are set up let's fine-tune the model the code runs the training Loop for 64 steps printing the learning rate and the loss for each step every eight steps the model prints its predictions for selected subset of validation set this lets us observe how the model's ability to detect objects improve over time in the early stages of the training we can even expect predictions that don't meet the required structure have too many or too few log TXS no class or no semicolon between detections as the training progresses the model's predictions become steadily more accurate and by step 64 detections should closely resemble the annotations provided by the training data this process takes several minutes on T4 GPU of course all those parameters are configurable and you should adjust them depending on your training set before we take a look at training results I want to invite you to our upcoming Community session if you don't know after every YouTube video we organ oriz a live stream where I answer your questions about the video but also about computer vision in general you can find the recordings of previous streams on our YouTube channel and the date and the time of the upcoming Community session in the description below come by and say hi see you there here's a sample of the results PMA achieved on the images from the validation set out of 16 examples the model made a mistake only ones confusing five with eight the model's accuracy is also confirmed by mean average Precision calculated to be 0.9 and the confusion Matrix where as you can see almost all values are on the diagonal which means that the vast majority of cases objects were detected and assigned the correct class time to save our trained model and test it on new images first we save the model with to the hard drive then deploy them to roof flow Universe selecting polyma 3eb PT 224 as the type similar to pulling weights uploading them may take a while but once the process is complete your model version page should look like this okay let's put our freshly trained model to the test I hand wrote few digits on a piece of paper took pictures of them and uploaded them to gradio app where I deployed the model as a prompt I use the same text that was in the prefix of my data set all three digits were detected correctly interestingly when we repeat our initial test and ask about detecting a dog the model is still able to do it this is a clear advantage of pjma over traditional CNN based detectors when we fine-tune the model on custom data set it can still detect the object classes on which it was pre-trained unfortunately when I upload an image containing multiple digits and mathematical operations the model struggles this isn't entirely surprising after all the original data set did not contain a single sample where multiple symbols were seen at the same time and I suspect that even state-of-the-art CNN would have problems with detecting those objects without data augmentation but but without further Ado let's talk about problems that I encountered while playing with PMA over the past few weeks first of all I have to come clean I did a bit of cherry picking when it came to choosing the data set originally I plan to use poker cards data set by the way also one of RF 100 data sets however despite spending considerable time and applying creative augmentation techniques I couldn't achieve a map over 0.25 it seems that the issue is related to the number of detections in a single image and I have few theories about the cost a large number of bounding boxes forces the model to generate longer sequences in some cases longer than the allow length defined by seek length variable I also suspect that the order of bounding boxes matters so even if the model Returns the correct bounding boxes but in different order than those defined in data set sufix the model gets simply confused or maybe in complicated cases like this uh fine-tuning only the attention layer of the language model is not enough or maybe we should even train a larger model eitherway in the end I decided to show you how to fine-tune polya on this specific case but I think that you should be aware that this problem exist and if any of you have an idea how to solve this problem please let me know in the comment I would be super grateful many of you have also asked me if polyma can be used to automatically label your data set in the past we've shown how to use zero sh detectors like grounding dyo or yellow world to automatically annotate data we even added this functionality to roof flow editor we tested PMA on RF 100 to evaluate its zero shot capabilities but they don't seem to be good enough especially compared to models like grounding dyo for over 50% of data set pemma achieved map equal to zero interestingly this includes the vehicle detection data set polya scored high on Twitter posts thermal dogs and people coins construction safety and animals data sets achieving over 50% map on each of them this aligns with Google statements about PMA it's a model that needs to be fine tuned to achieve satisfaction results and that's all for today I highly encourage you to try and fine tune Pema on your own data and remember all data sets on roof flow Universe are right now accessible in Json L format compatible with pjama if you'll have any problems or questions let me know in the comments below and I'll try to answer them during the upcoming Community session where we are going to talk about not only PMA but also latest yellow V1 fine tuning P Gemma was another challenging yet incredibly rewarding project that I had an opportunity to do on this channel I wanted to go beyond classical CNN based models for quite some time and that's awesome to finally have an opportunity to do it while PMA is not perfect and fine-tuning it on data sets with large amount of classes or high amount of bounding boxes per image is hard it's a step in the right direction and opens the door for accessible open source easily fine-tunable multimodel models that anybody can run on their own GPU and who knows where we are going to be next year in the meantime if you enjoyed the video make sure to like And subscribe and stay tuned for more computer vision content coming to this channel soon my name is Peter and I see you next time bye
Info
Channel: Roboflow
Views: 4,002
Rating: undefined out of 5
Keywords: PaliGemma, Object Detection, Vision-Language Model (VLM), Large Multimodal Modal, JAX, JSONL, Google AI, Image Captioning, Visual Question Answering (VQA), Fine-tuning, Custom Dataset, PaliGemma tutorial, PaliGemma object detection, PaliGemma fine-tuning tutorial, RF100 dataset, Multimodal AI, SigLIP, Gemma, How to fine-tune PaliGemma, PaliGemma object detection tutorial, Deploying PaliGemma models
Id: OMBmVInx68M
Channel Id: undefined
Length: 21min 19sec (1279 seconds)
Published: Mon Jun 03 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.