Segment Anything Model (SAM) from Meta AI: model architecture, data engine, results and limitations

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign natural language processing something called the foundational models exist these models are trained for sequence prediction where the models can predict the next word in a sentence these Foundation models can be easily used for any other NLP task like translation or text summarization using what is called zero shot transfer learning the most well-known way to achieve zero short transfer learning for a specific task is by prompting with which we have been interacting with say chat GPT the main reason for the prevalence of such foundational models in NLP is that the data is available at scale text is everywhere in the web and almost all the text can be used for sequence prediction training as No Labels are needed for sequence prediction when it comes to computer vision even though we have billions of images on the web these days these images are not labeled with bounding boxes or segmentation masks and so establishing Foundation models has been challenging so can we address this very problem and introduce foundational models for computer vision or more specifically for segmentation so that we can do zero short learning for a different task using just prompting instead of retraining for the new task the segment anything model or Sam does just that and solves this very problem because without further Ado let's get unlike language models Imaging models are special they take images as input and so how can we prompt a segmentation model The Prompt can be a number of things ranging from say simple points on a given canvas indicating where to segment in a given input image or it can be one or more bounding boxes or even a rough drawing on a canvas indicating what a segment in the input image last but not the least it could literally be a text prompt explaining what a segment in the image in any case the model should be flexible enough to handle any of these inputs and output segmentation masks for this reason the model architecture has an image encoder which encodes the input image into a standard representation called embeddings these days there are several neural networks available for embeddings the choice of encoder by the authors is the Mast Auto encoder pretained Vision Transformer that can handle higher resolution inputs in order to encode the prompts they use prompt encoders if the input is dense such as a rough mask of the object they use convolutional operations and if the input prompt is sparse such as points or bounding boxes they use positional encodings and lastly if the input is text prompt they use clip embeddings now the image unmask embeddings are then fused together using element wise summation and finally put through a decoder as we now have to uplift these embeddings to the size of the image itself to arrive at the segmentation mask the dimensions of which matches the dimensions of the input so for the decoder they have chosen a modified Transformer decoder block to train the setup they use a linear combination of focal loss and dice laws but at the output you can notice that there are three scores rather than a single score for segmentation that's to eliminate ambiguity for this example image let's say you clicked a single point on the scissors handle as a prompt the model doesn't know if you wish to segment the entire scissors or just the handle so it makes sense to train for three levels of details or granularity and so we have three outputs unlike normal segmentation where we only have one output putting all the information together is a colorful animation on the Sam model website the link for which I've given in the description of this video what's interesting is that they have made the decoder and the prompt encoder extremely lightweight and it just takes 55 milliseconds on the web browser without using a GPU now that's amazing but the image encodings are computed once per image on the server side using the encoder and I'm guessing the embeddings are sent back to the browser to be stored in the browser DB so that it can be used any number of times with different prompts in order to avoid computational overheads the training procedure is different from standard training of a neural network because we are aiming to achieve a foundational model rather than a segmentation model so driven by the lack of abundance segmentation data on the internet they have built a data engine which resulted in a huge data set of 1.1 billion masks over 11 million images this data generation system or data engine was developed in three stages in the first assisted manual stage they trained the Sam model with commonly available public data sets for segmentation and let this model interact with manual annotators who use the browser to correct the output masks by erasing and brushing the canvas now after Gathering sufficient data they retrain the Sam model with this new data and this cycle of periodic retraining was continued six times to evolve the final model for the stage now at some point during this iteration the encoder Network size was also increased from vitb to a larger vid hedge model now after all this this stage resulted in 120 000 images annotated with 4.3 million masks and the output mass per image on average went up to 44 by the end of the stage now the second semi-automatic stage focuses on improving the diversity of the Sam model to improve the diversity the annotators were asked to label additional unlabeled objects that were much more detailed in the image and so it took longer to label at this stage by the end of this stage they had labeled 180 000 images with 5.9 million masks similar to the first stage dated periodic retraining but only five times the output mass per image increased from 44 to 72 by the end of this stage the last fully automated stage introduced prompting at the input with 32 by 32 regular grids as input and the output at this stage would be part sub part and whole object to further refine the quality they introduced zoomed in image crops at the end of the stage they had 1.1 billion masks on 11 million images leading to the sa-1b dataset one thing to note about this sa-1b data set is that even though it has 1 billion masks these masks are fully automatically generated by the Sam model now the next big point is that the data set has very high resolution images compared to say a Coca data set and lastly as seen from this figure The Masks are quite evenly distributed across the image compared to all of the previous data sets like coco or open images now that we have the sum model trained on sa-1b dataset the model is readily available for zero short transfer learning on a novel task the main goal here is to use the model on a prediction task that it has never been trained for so our goal is to produce a valid mask for all of these tasks let's first see what the tasks are single point valid Mass segmentation is an ill post task which is provide one point as an input prompt and the model produces the full mask Edge detection is when you input an image and you expect the model to identify the edges in the image object proposals are candidates for object detection in many object detection systems these proposals are evaluated by the detection model to find out which proposal belongs exactly to the object instant segmentation is when you find out each and every object of the same class in the image in this example the class is a person and the instance segmentation model should identify there are five persons with those quick explanation of the tasks let's see the results as how Sam has performed on each of these tasks to evaluate the single point Mass segmentation they collated 23 diverse segmentation data sets and compared the performance of a state-of-the-art algorithm called ritm against Sam these are some sample images from each of the 23 data sets and these are the reported results which indicate that Sam performs better than ritm in 16 out of the 23 data sets and that is quite impressive when it comes to the low-level Mission task of edge detection they compared the performance transferring Sam on the bsds 500 data set which is a standard data set for Edge detection benchmarking produce results by evaluating 16 by 16 grid points as inputs leading to 768 masks per image the results in the figures indicate that the Sam model does not understand which edges to suppress and which edges to keep mainly because this is a general purpose foundational model and it's not biased by the data set nevertheless it seems to perform Edge detection as shown in this figure from the paper moving on to the mid-level vision task of generating object proposals Sam was slightly modified to convert the output masks as proposal bounding boxes then they use the lvis dataset to evaluate this task and they compare the vit did object detector model the obvious metric for comparison here is the average recall and the results indicate that the Sam outperforms on medium and large objects and only underperforms when the objects are small and occur quite frequently in the image instant segmentation can be considered a high level task in the hierarchy of recognition making Sam do instant segmentation seems straightforward you get object proposals from The Proposal system like vit did or something similar and beat the boundary boxes as prompts to the Sam model this results in instant segmentations at the output based on the results of evaluation on the Elvis data set Sam only slightly underperforms compared to the purpose-built instant segmentation models an even high level ambitious task is to input text prompts and produce masks at the output they mentioned themselves that this idea is a proof of concept in order to cater Sam for this problem you first get the image embeddings from a clip model and use those as input to Sam model now during inference because the text and the image are aligned in clip model we can simply input the text to The Click model and get the segmented in images as output as a result the Sam model can understand simple prompts however as seen from this figure from the paper it seems to do a far better job when guided further with some input points I'm pretty sure that the team at meta are working further to fix these shortcomings and I just can't wait to see the marriage between the multiple modalities like text image and speech will we wait for much more AI breakthroughs and sophisticated papers to come I will let you digest all the information about Sam in this video and I will see you in my next take care bye
Info
Channel: AI Bites
Views: 12,265
Rating: undefined out of 5
Keywords: machinelearning, deeplearning, transformers, artificial intelligence, AI, deep learning, machine learning, educational, how to learn AI
Id: qa3uK3Ewd9Q
Channel Id: undefined
Length: 16min 17sec (977 seconds)
Published: Tue Apr 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.